{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Estimating Baseline Performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is a baseline model? \n",
    "\n",
    "Producing a baseline model is crucial for evaluating your model's performance on any machine learning problem. A baseline model is a basic solution that serves as a point of reference for comparing other models to. The baseline model's performance gives us an indication of how much better our models can perform relative to a naive approach. \n",
    "\n",
    "Let's say we are building a sentence similarity model where our training set contains pairs of sentences and we want to predict how similiar these sentences are on a scale from 1-5. We could spend months producing a complex machine learning solution to this problem and ultimately get a mean squared error (MSE) of 0.3. But is this result good or bad? There is no way of knowing without comparing it with some baseline performance. For our baseline model, we could predict the mean sentence similarity of sentence pairs in our training set (called the _zero rule_) and get a MSE of 0.35. So our model is worse than the baseline which indicates that we may want to consider using different features, models, evaluation metrics, etc. It is crucial that the choice of baseline model be tailored to a data science problem based on buisness goals and the specific modeling task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What are good baselines for sentence similarity?\n",
    "\n",
    "For sentence similarity problems, we have two sub-tasks: 1) First, we need to produce a vector representation of each sentence in the sentence pair, known as an **embedding**. 2) Second, we need to compute the similarity between these two sentence embeddings.\n",
    "\n",
    "For producing representations of sentences, there are some common baseline approaches: \n",
    "1. Create word embeddings for each word in a sentence\n",
    "    1. word2vec word embeddings\n",
    "    2. GLoVe word embeddings\n",
    "    3. fastText word embeddings\n",
    "    \n",
    "2. Create sentence embeddings\n",
    "    1. doc2vec document embeddings\n",
    "    2. TF-IDF embeddings \n",
    "\n",
    "Then we have to compare our embeddings to calculate sentence similarity:\n",
    "1. Word Embedding comparison\n",
    "    1. Cosine Similarity (first requires averaging the word embeddings of all words in each sentence)\n",
    "    2. Word Mover's Distance\n",
    "\n",
    "2. Sentence Embedding comparison\n",
    "    1. Cosine Similarity  \n",
    "    \n",
    "The different embedding models and similarity metrics are introduced below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Data Loading and Preprocessing](#Data-Loading-and-Preprocessing)\n",
    "    * [Load STS Benchmark Dataset](#Load-STS-Benchmark-Dataset)\n",
    "    - [Preprocess / Tokenize](#Data-Preprocessing-/-Tokenization)\n",
    "    - [Document Frequency Calculation](#Document-Frequency-Calculation)\n",
    "* [Baseline Models](#Baseline-Models)\n",
    "    - [Baseline #1: word2vec and cosine similarity](#Baseline-#1:-Word2vec-Embeddings-with-Cosine-Similarity)\n",
    "    - [Baseline #2: word2vec and Word Mover's Distance](#Baseline-#2:-Word2vec-Embeddings-with-Word-Mover's-Distance)\n",
    "    - [Baseline #3: GloVe and cosine similarity](#Baseline-#3:-GloVe-Embeddings-with-Cosine-Similarity)\n",
    "    * [Baseline #4: GloVe and Word Mover's Distance](#Baseline-#4:-GloVe-Embeddings-with-Word-Mover's-Distance)\n",
    "    - [Baseline #5: fastText and cosine similarity](#Baseline-#5:-fastText-Embeddings-with-Cosine-Similarity)\n",
    "    - [Baseline #6: fastText and Word Mover's Distance](#Baseline-#6:-fastText-Embeddings-with-Word-Mover's-Distance)\n",
    "\n",
    "    * [Baseline #7: TF-IDF and cosine similarity](#Baseline-#7:-TF-IDF-Embeddings-with-Cosine-Similarity)\n",
    "    * [Baseline #8: Doc2vec and cosine similarity](#Baseline-#8:-Doc2vec-Embeddings-with-Cosine-Similarity))\n",
    "\n",
    "* [Comparison of Baseline Models](#Comparison-of-Baseline-Models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reference running time \n",
    "The table below provide some reference running time of each section on CPU and GPU machines.  \n",
    "\n",
    "|Notebook Section|4 **CPU**s, 14GB memory VM| 1 NVIDIA Tesla K80 GPU, 12GB GPU memory VM|\n",
    "|:---------------|:------------------------:|:------------------------------------------:|\n",
    "|Whole notebook| ~ 35 mintues| ~ 28 minutes|\n",
    "|Data Loading and Preprocessing| ~ 8 minutes| ~ 6 minutes|\n",
    "|Baseline #1| ~ 4 minutes| ~ 3 minutes|\n",
    "|Baseline #2| ~ 5 seconds| ~ 3 seconds|\n",
    "|Baseline #3| ~ 18 minutes| ~ 14 minutes|\n",
    "|Baseline #4| ~ 5 seconds| ~ 5 seconds|\n",
    "|Baseline #5| Memory error, please skip if error occurs| ~ 3 minutes|\n",
    "|Baseline #6| Memory error, please skip if error occurs| ~ 5 seconds|\n",
    "|Baseline #7| ~ 6 seconds| ~ 5 seconds|\n",
    "|Baseline #8| ~ 3 minutes| ~ 2 minutes|\n",
    "|Comparison of Baseline Models| ~ 1 second| ~ 2 seconds|\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]\n",
      "Gensim version: 3.7.3\n"
     ]
    }
   ],
   "source": [
    "#Import Packages\n",
    "import sys\n",
    "# Set the environment path\n",
    "sys.path.append(\"../../\")  \n",
    "import os\n",
    "from collections import Counter\n",
    "import math\n",
    "import numpy as np\n",
    "from tempfile import TemporaryDirectory\n",
    "import scrapbook as sb\n",
    "import scipy\n",
    "from scipy.spatial import distance\n",
    "import gensim\n",
    "from gensim.models.doc2vec import LabeledSentence\n",
    "from gensim.models.doc2vec import TaggedDocument\n",
    "from gensim.models import Doc2Vec\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "#Import utility functions\n",
    "from utils_nlp.dataset.preprocess import to_lowercase, to_spacy_tokens\n",
    "from utils_nlp.dataset import stsbenchmark\n",
    "from utils_nlp.dataset.preprocess import (\n",
    "    to_lowercase,\n",
    "    to_spacy_tokens,\n",
    "    rm_spacy_stopwords,\n",
    ")\n",
    "from utils_nlp.models.pretrained_embeddings import word2vec\n",
    "from utils_nlp.models.pretrained_embeddings import glove\n",
    "from utils_nlp.models.pretrained_embeddings import fasttext\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Gensim version: {}\".format(gensim.__version__))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the path where you datasets are located\n",
    "tmp_dir = TemporaryDirectory()\n",
    "BASE_DATA_PATH = tmp_dir.name "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Loading and Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load STS Benchmark Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we utilize the [STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) which contains a selection of English datasets that were used in Semantic Textual Similarity (STS) tasks 2012-2017. The datasets include text from image captions, news headlines, and user forums. The dataset contains 8,628 sentence pairs with a human-labeled integer representing the sentences' similarity (ranging from 0, for no meaning overlap, to 5, meaning equivalence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████| 401/401 [00:01<00:00, 247KB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data downloaded to C:\\Users\\cocochra\\AppData\\Local\\Temp\\tmpp2a0cw_t\\raw\\stsbenchmark\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████| 401/401 [00:01<00:00, 243KB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data downloaded to C:\\Users\\cocochra\\AppData\\Local\\Temp\\tmpp2a0cw_t\\raw\\stsbenchmark\n"
     ]
    }
   ],
   "source": [
    "# Produce a pandas dataframe for the training and test sets\n",
    "train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")\n",
    "test_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"test\")\n",
    "\n",
    "# Clean the sts dataset\n",
    "sts_train = stsbenchmark.clean_sts(train_raw)\n",
    "sts_test = stsbenchmark.clean_sts(test_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training set has 5749 sentences\n",
      "Testing set has 1379 sentences\n"
     ]
    }
   ],
   "source": [
    "print(\"Training set has {} sentences\".format(len(sts_train)))\n",
    "print(\"Testing set has {} sentences\".format(len(sts_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>score</th>\n",
       "      <th>sentence1</th>\n",
       "      <th>sentence2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.500</td>\n",
       "      <td>A girl is styling her hair.</td>\n",
       "      <td>A girl is brushing her hair.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.600</td>\n",
       "      <td>A group of men play soccer on the beach.</td>\n",
       "      <td>A group of boys are playing soccer on the beach.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.000</td>\n",
       "      <td>One woman is measuring another woman's ankle.</td>\n",
       "      <td>A woman measures another woman's ankle.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.200</td>\n",
       "      <td>A man is cutting up a cucumber.</td>\n",
       "      <td>A man is slicing a cucumber.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.500</td>\n",
       "      <td>A man is playing a harp.</td>\n",
       "      <td>A man is playing a keyboard.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.800</td>\n",
       "      <td>A woman is cutting onions.</td>\n",
       "      <td>A woman is cutting tofu.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>3.500</td>\n",
       "      <td>A man is riding an electric bicycle.</td>\n",
       "      <td>A man is riding a bicycle.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2.200</td>\n",
       "      <td>A man is playing the drums.</td>\n",
       "      <td>A man is playing the guitar.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2.200</td>\n",
       "      <td>A man is playing guitar.</td>\n",
       "      <td>A lady is playing the guitar.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.714</td>\n",
       "      <td>A man is playing a guitar.</td>\n",
       "      <td>A man is playing a trumpet.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   score                                      sentence1  \\\n",
       "0  2.500                    A girl is styling her hair.   \n",
       "1  3.600       A group of men play soccer on the beach.   \n",
       "2  5.000  One woman is measuring another woman's ankle.   \n",
       "3  4.200                A man is cutting up a cucumber.   \n",
       "4  1.500                       A man is playing a harp.   \n",
       "5  1.800                     A woman is cutting onions.   \n",
       "6  3.500           A man is riding an electric bicycle.   \n",
       "7  2.200                    A man is playing the drums.   \n",
       "8  2.200                       A man is playing guitar.   \n",
       "9  1.714                     A man is playing a guitar.   \n",
       "\n",
       "                                          sentence2  \n",
       "0                      A girl is brushing her hair.  \n",
       "1  A group of boys are playing soccer on the beach.  \n",
       "2           A woman measures another woman's ankle.  \n",
       "3                      A man is slicing a cucumber.  \n",
       "4                      A man is playing a keyboard.  \n",
       "5                          A woman is cutting tofu.  \n",
       "6                        A man is riding a bicycle.  \n",
       "7                      A man is playing the guitar.  \n",
       "8                     A lady is playing the guitar.  \n",
       "9                       A man is playing a trumpet.  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sts_test.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Preprocessing / Tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our baseline models will expect that each sentence is represented by a list of **tokens**. Tokens are linguistic units like words, punctuation marks, numbers, etc. We'll use our util functions which utilize the spaCy package, a popular package for performing tokenization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's also common to remove high-frequency words which do not help distinguish one sentence from another, so called **stop words**. For example, \"the\", \"and\", \"a\", etc. are typical stop words although each tokenization package may differ in the words they consider to be stop words. We'll tokenize our corpus with and without stop words so that we can compare our methods with and without stop words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training Set Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert all text to lowercase\n",
    "df_low = to_lowercase(sts_train)  \n",
    "# Tokenize text\n",
    "sts_tokenize = to_spacy_tokens(df_low) \n",
    "# Tokenize with removal of stopwords\n",
    "sts_train_stop = rm_spacy_stopwords(sts_tokenize) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now each row in our dataframe contains:  \n",
    "- The similarity score of the sentence pair\n",
    "- The 2 original sentences from our datasets  \n",
    "- A column for each sentence's tokenization with stop words  \n",
    "- A column for each sentence's tokenization without stop words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>score</th>\n",
       "      <th>sentence1</th>\n",
       "      <th>sentence2</th>\n",
       "      <th>sentence1_tokens</th>\n",
       "      <th>sentence2_tokens</th>\n",
       "      <th>sentence1_tokens_rm_stopwords</th>\n",
       "      <th>sentence2_tokens_rm_stopwords</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.00</td>\n",
       "      <td>a plane is taking off.</td>\n",
       "      <td>an air plane is taking off.</td>\n",
       "      <td>[a, plane, is, taking, off, .]</td>\n",
       "      <td>[an, air, plane, is, taking, off, .]</td>\n",
       "      <td>[plane, taking, .]</td>\n",
       "      <td>[air, plane, taking, .]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.80</td>\n",
       "      <td>a man is playing a large flute.</td>\n",
       "      <td>a man is playing a flute.</td>\n",
       "      <td>[a, man, is, playing, a, large, flute, .]</td>\n",
       "      <td>[a, man, is, playing, a, flute, .]</td>\n",
       "      <td>[man, playing, large, flute, .]</td>\n",
       "      <td>[man, playing, flute, .]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.80</td>\n",
       "      <td>a man is spreading shreded cheese on a pizza.</td>\n",
       "      <td>a man is spreading shredded cheese on an uncoo...</td>\n",
       "      <td>[a, man, is, spreading, shreded, cheese, on, a...</td>\n",
       "      <td>[a, man, is, spreading, shredded, cheese, on, ...</td>\n",
       "      <td>[man, spreading, shreded, cheese, pizza, .]</td>\n",
       "      <td>[man, spreading, shredded, cheese, uncooked, p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.60</td>\n",
       "      <td>three men are playing chess.</td>\n",
       "      <td>two men are playing chess.</td>\n",
       "      <td>[three, men, are, playing, chess, .]</td>\n",
       "      <td>[two, men, are, playing, chess, .]</td>\n",
       "      <td>[men, playing, chess, .]</td>\n",
       "      <td>[men, playing, chess, .]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.25</td>\n",
       "      <td>a man is playing the cello.</td>\n",
       "      <td>a man seated is playing the cello.</td>\n",
       "      <td>[a, man, is, playing, the, cello, .]</td>\n",
       "      <td>[a, man, seated, is, playing, the, cello, .]</td>\n",
       "      <td>[man, playing, cello, .]</td>\n",
       "      <td>[man, seated, playing, cello, .]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   score                                      sentence1  \\\n",
       "0   5.00                         a plane is taking off.   \n",
       "1   3.80                a man is playing a large flute.   \n",
       "2   3.80  a man is spreading shreded cheese on a pizza.   \n",
       "3   2.60                   three men are playing chess.   \n",
       "4   4.25                    a man is playing the cello.   \n",
       "\n",
       "                                           sentence2  \\\n",
       "0                        an air plane is taking off.   \n",
       "1                          a man is playing a flute.   \n",
       "2  a man is spreading shredded cheese on an uncoo...   \n",
       "3                         two men are playing chess.   \n",
       "4                 a man seated is playing the cello.   \n",
       "\n",
       "                                    sentence1_tokens  \\\n",
       "0                     [a, plane, is, taking, off, .]   \n",
       "1          [a, man, is, playing, a, large, flute, .]   \n",
       "2  [a, man, is, spreading, shreded, cheese, on, a...   \n",
       "3               [three, men, are, playing, chess, .]   \n",
       "4               [a, man, is, playing, the, cello, .]   \n",
       "\n",
       "                                    sentence2_tokens  \\\n",
       "0               [an, air, plane, is, taking, off, .]   \n",
       "1                 [a, man, is, playing, a, flute, .]   \n",
       "2  [a, man, is, spreading, shredded, cheese, on, ...   \n",
       "3                 [two, men, are, playing, chess, .]   \n",
       "4       [a, man, seated, is, playing, the, cello, .]   \n",
       "\n",
       "                 sentence1_tokens_rm_stopwords  \\\n",
       "0                           [plane, taking, .]   \n",
       "1              [man, playing, large, flute, .]   \n",
       "2  [man, spreading, shreded, cheese, pizza, .]   \n",
       "3                     [men, playing, chess, .]   \n",
       "4                     [man, playing, cello, .]   \n",
       "\n",
       "                       sentence2_tokens_rm_stopwords  \n",
       "0                            [air, plane, taking, .]  \n",
       "1                           [man, playing, flute, .]  \n",
       "2  [man, spreading, shredded, cheese, uncooked, p...  \n",
       "3                           [men, playing, chess, .]  \n",
       "4                   [man, seated, playing, cello, .]  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sts_train_stop.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test Set Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert all text to lowercase\n",
    "df_low = to_lowercase(sts_test)\n",
    "# Tokenize text\n",
    "sts_tokenize = to_spacy_tokens(df_low)\n",
    "# Tokenize with removal of stopwords\n",
    "sts_test_stop = rm_spacy_stopwords(sts_tokenize)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Document Frequency Calculation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many baseline models we explore will require calculation of how frequently a word appears in our corpus. To calculate this, we iterate through the sentences in our training set and count the number of sentences that contain each word. There are other ways to produce this calculation, including pulling larger datasets from the web (like Wikipedia data) and calculating the frequencies on that data. Note that \"document\" refers to some larger chunk of multiple tokens/words. In our case, our documents will actually be individual sentences. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_document_frequency(df):\n",
    "    \"\"\"Iterate through all sentences in dataframe and create a dictionary \n",
    "    mapping tokens to the number of sentences in our corpus they appear in\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe of sentence pairs with their similarity scores\n",
    "        \n",
    "    Returns:\n",
    "        document_frequency_dict (dictionary): mapping from tokens to number of sentences they appear in\n",
    "        n (int): number of sentences in the corpus\n",
    "    \"\"\"\n",
    "    document_frequency_dict = {}\n",
    "    all_sentences =  df[[\"sentence1_tokens\", \"sentence2_tokens\"]]\n",
    "    sentences = all_sentences.values.flatten().tolist()\n",
    "    n = len(sentences)\n",
    "\n",
    "    for s in sentences:\n",
    "        for token in set(s):\n",
    "            document_frequency_dict[token] = document_frequency_dict.get(token, 0) + 1\n",
    "\n",
    "    return document_frequency_dict, n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note that we need to calculate these values on our training set so that we don't \"peek at\" our test set until test time\n",
    "document_frequencies, num_documents = get_document_frequency(sts_train_stop)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11498"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Baseline Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we consider each of the baseline models, we'll save all model predictions in a dictionary and will evaluate the results at the end of this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "baselines = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #1: Word2vec Embeddings with Cosine Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs word embeddings using word2vec. Once we have a word embedding (vector) for each word in the sentence, we calculate an embedding for the full sentence by taking the (weighted) average of all the word embeddings. The weights will be calculated using TF-IDF. Lastly, in order to compare the two sentence embeddings we use the cosine similarity metric. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is Word2Vec?\n",
    "Word2vec is a predictive model for learning word embeddings from text (see [original research paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)). Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the \"context\") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed information about the model.\n",
    "\n",
    "For our purposes, we use pretrained word2vec word embeddings. These embeddings were trained on a Google News corpus and provide 300-dimensional embeddings (vectors) for 3 million English words. See this [link](https://code.google.com/archive/p/word2vec/) for the original location of the embeddings and see the code below to load these word embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████| 1.61M/1.61M [01:08<00:00, 23.4kKB/s]\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\smart_open\\smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function\n",
      "  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL\n"
     ]
    }
   ],
   "source": [
    "word2vec_model = word2vec.load_pretrained_vectors(dir_path=BASE_DATA_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is TF-IDF?\n",
    "\n",
    "TF-IDF or term frequency-inverse document frequency is a weighting scheme intended to measure how important a word is to the document (or sentence in our case) within the broader corpus (our dataset). The weight \"increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus\" ([tutorial link](http://www.tfidf.com/)). When we're averaging together many different word vectors to get a sentence embedding, it makes sense to give stronger weights to words that are more distinct relative to the corpus and that have a high frequency in the sentence. The TF-IDF weights capture this intution, with the weight increasing as term frequency increases and/or as the inverse document frequency increases.\n",
    "\n",
    "For a term $t$ in sentence $s$ in corpus $c$, then the TF-IDF weight is \n",
    "$$w_{t,s} = TF_{t,s} * \\log{\\frac{N}{df_t}}$$\n",
    "where:  \n",
    "$TF_{t,s}$ = the number of times term $t$ appears in sentence $s$  \n",
    "$df_t$ = the number of sentences containing term $t$  \n",
    "$N$ = the size of the corpus.  \n",
    "\n",
    "In these baselines, we calculate the TF-IDF weighted average of all the word embeddings. The code below implements this weighted average given a list of tokens and an embedding model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def average_sentence_embedding(tokens, embedding_model):\n",
    "    \"\"\"Calculate TF-IDF weighted average embedding for a sentence\n",
    "    \n",
    "    Args:\n",
    "        tokens (list): list of tokens in a sentence\n",
    "        embedding_model: model to use for word embedding (word2vec, glove, fastText, etc.)\n",
    "    \n",
    "    Returns:\n",
    "        list: vector representing the sentence\n",
    "    \"\"\"\n",
    "    # Throw away tokens that are not in the embedding model\n",
    "    tokens = [i for i in tokens if i in embedding_model]\n",
    "\n",
    "    if len(tokens) == 0:\n",
    "        return []\n",
    "\n",
    "    # We will weight by TF-IDF. The TF part is calculated by:\n",
    "    # (# of times term appears / total terms in sentence)\n",
    "    count = Counter(tokens)\n",
    "    token_list = list(count)\n",
    "    term_frequency = [count[i] / len(tokens) for i in token_list]\n",
    "\n",
    "    # Now for the IDF part: LOG(# documents / # documents with term in it)\n",
    "    inv_doc_frequency = [\n",
    "        math.log(num_documents / (document_frequencies.get(i, 0) + 1)) for i in count\n",
    "    ]\n",
    "\n",
    "    # Put the TF-IDF together and produce the weighted average of vector embeddings\n",
    "    word_embeddings = [embedding_model[token] for token in token_list]\n",
    "    weights = [term_frequency[i] * inv_doc_frequency[i] for i in range(len(token_list))]\n",
    "    return list(np.average(word_embeddings, weights=weights, axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is Cosine Similarity?\n",
    "\n",
    "Cosine similarity is a common similarity metric between vectors. Intuitively it measures the cosine of the angle between any two vectors. With vectors $a$ and $b$, the cosine similarity is: cosine similarity($a$,$b$) = $\\frac{\\vec{a} \\cdot \\vec{b} }{||\\vec{a}|| ||\\vec{b}||}$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_cosine_similarity(embedding1, embedding2):\n",
    "    \"\"\"Calculate cosine similarity between two embedding vectors\n",
    "    \n",
    "    Args:\n",
    "        embedding1 (list): embedding for the first sentence\n",
    "        embedding2 (list): embedding for the second sentence\n",
    "    \n",
    "    Returns:\n",
    "        list: cosine similarity value between the two embeddings\n",
    "    \"\"\"\n",
    "    # distance.cosine calculates cosine DISTANCE, so we need to\n",
    "    # return 1 - distance to get cosine similarity\n",
    "    cosine_similarity = 1 - distance.cosine(embedding1, embedding2)\n",
    "    return cosine_similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Sentence Similarity Predictions for Test Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we calculate predictions for each sentence pair found in the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def average_word_embedding_cosine_similarity(df, embedding_model, rm_stopwords=False):\n",
    "    \"\"\"Calculate the cosine similarity between TF-IDF weighted averaged embeddings\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe as provided by the nlp_utils\n",
    "        embedding_model: word embedding model\n",
    "        rm_stopwords (bool): whether to remove stop words (True) or not (False)\n",
    "    \n",
    "    Returns:\n",
    "        list: predicted values for sentence similarity of test set examples\n",
    "    \"\"\"\n",
    "    if rm_stopwords:\n",
    "        df['sentence1_embedding'] = df.apply(lambda x: average_sentence_embedding(x.sentence1_tokens_rm_stopwords, embedding_model), axis=1)\n",
    "        df['sentence2_embedding'] = df.apply(lambda x: average_sentence_embedding(x.sentence2_tokens_rm_stopwords, embedding_model), axis=1)\n",
    "    else:\n",
    "        df['sentence1_embedding'] = df.apply(lambda x: average_sentence_embedding(x.sentence1_tokens, embedding_model), axis=1)\n",
    "        df['sentence2_embedding'] = df.apply(lambda x: average_sentence_embedding(x.sentence2_tokens, embedding_model), axis=1)\n",
    "\n",
    "    df['predictions'] = df.apply(lambda x: calculate_cosine_similarity(x.sentence1_embedding, x.sentence2_embedding) if \n",
    "                                 (sum(x.sentence1_embedding) != 0 and sum(x.sentence2_embedding) != 0) else 0, axis=1)\n",
    "    \n",
    "    return df['predictions'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get predictions using average word2vec embeddings both with and without stop words\n",
    "baselines[\"Word2vec Cosine\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, word2vec_model, rm_stopwords=True\n",
    ")\n",
    "baselines[\"Word2vec Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, word2vec_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #2: Word2vec Embeddings with Word Mover's Distance "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs word embeddings using word2vec (for an introduction to word2vec, see [Background on Word2Vec](#What-is-Word2Vec?)). Then all the word embeddings are used to calculate sentence similarity using the word mover's distance.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is Word Mover's Distance (WMD)?\n",
    "Word Mover's Distance (WMD) is a metric that \"adapts the earth mover’s distance to the space of documents: the distance between two texts is given by the total amount of “mass” needed to move the words from one side into the other, multiplied by the distance the words need to move.\" We'll utilize word2vec's implementation of word mover's distance. See this [blog](http://vene.ro/blog/word-movers-distance-in-python.html) for additional information about this similarity measure."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Sentence Similarity Predictions for Test Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we calculate predictions for each of sentence pairs found in the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def word_embedding_WMD(df, embedding_model, rm_stopwords=False):\n",
    "    \"\"\"Calculate Word Mover's Distance between two sentences using embeddings\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe as provided by the nlp_utils\n",
    "        embedding_model (gensim model): word embedding model\n",
    "        rm_stopwords (bool): whether to remove stop words (True) or not (False)\n",
    "    \n",
    "    Returns:\n",
    "        list: predicted values for sentence similarity of test set examples\n",
    "    \"\"\"\n",
    "    if rm_stopwords:\n",
    "        df['sentence1_cleaned'] = df.apply(lambda x: [i for i in x.sentence1_tokens_rm_stopwords if i in embedding_model], axis=1)\n",
    "        df['sentence2_cleaned'] = df.apply(lambda x: [i for i in x.sentence2_tokens_rm_stopwords if i in embedding_model], axis=1)\n",
    "    else:\n",
    "        df['sentence1_cleaned'] = df.apply(lambda x: [i for i in x.sentence1_tokens if i in embedding_model], axis=1)\n",
    "        df['sentence2_cleaned'] = df.apply(lambda x: [i for i in x.sentence2_tokens if i in embedding_model], axis=1)\n",
    "\n",
    "    # wmdistance takes the raw tokens and performs the word2vec embedding itself\n",
    "    df['predictions'] = df.apply(lambda x: -embedding_model.wmdistance(x.sentence1_cleaned, x.sentence2_cleaned) if \n",
    "                                 (len(x.sentence1_cleaned) != 0 and len(x.sentence2_cleaned) != 0) else 0, axis=1)\n",
    "    \n",
    "    return df['predictions'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get predictions using word2vec embeddings and WMD both with and without stop words\n",
    "baselines[\"Word2vec WMD\"] = word_embedding_WMD(sts_test_stop, word2vec_model, rm_stopwords=True)\n",
    "baselines[\"Word2vec WMD with Stop Words\"] = word_embedding_WMD(\n",
    "    sts_test_stop, word2vec_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #3: GloVe Embeddings with Cosine Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs word embeddings using GloVE. Once we have a word embedding (vector) for each word in the sentence, we calculate an embedding for the full sentence by taking the (weighted) average of all the word embeddings. The weights will be calculated using TF-IDF. Lastly, in order to compare the two sentence embeddings we use the cosine similarity metric (for an introduction to the cosine similarity metric, see [Background on Cosine Similarity](#What-is-Cosine-Similarity?)). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is GloVe?\n",
    "GloVe is an unsupervised algorithm for obtaining word embeddings created by the Stanford NLP group (see [original research paper](https://nlp.stanford.edu/pubs/glove.pdf)) Training occurs on word-word co-occurrence statistics with the objective of learning word embeddings such that the dot product of two words' embeddings is equal to the words' probability of co-occurrence. See this [tutorial](https://nlp.stanford.edu/projects/glove/) on GloVe for more detailed background on the model. For our purposes, we use pretrained GloVe word embeddings (glove.840B.300d.zip which can be found through the above link). These embeddings were trained on Common Crawl data and provide 300-dimensional embeddings (vectors) for 2.2 million English words. Below is the code to load in the GloVe embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████| 2.13M/2.13M [01:58<00:00, 17.9kKB/s]\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\smart_open\\smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function\n",
      "  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL\n"
     ]
    }
   ],
   "source": [
    "glove_model = glove.load_pretrained_vectors(dir_path=BASE_DATA_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get predictions using GloVe embeddings and cosine similarity both with and without stop words\n",
    "baselines[\"GLoVe Cosine\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, glove_model, rm_stopwords=True\n",
    ")\n",
    "baselines[\"GLoVe Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, glove_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #4: GloVe Embeddings with Word Mover's Distance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs word embeddings using GloVe (for an introduction on GloVe, see [Background on GloVe](#What-is-GloVe?)). Then all the word embeddings are used to calculate sentence similarity using the word mover's distance (for an introduction to WMD, see [Background on Word Mover's Distance](#What-is-Word-Mover's-Distance-(WMD)?)).  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get predictions using GloVe embeddings and WMD both with and without stop words\n",
    "baselines[\"GLoVe WMD\"] = word_embedding_WMD(sts_test_stop, glove_model, rm_stopwords=True)\n",
    "baselines[\"GLoVe WMD with Stop Words\"] = word_embedding_WMD(\n",
    "    sts_test_stop, glove_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #5: fastText Embeddings with Cosine Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs word embeddings using fastText. Once we have a word embedding (vector) for each word in the sentence, we calculate an embedding for the full sentence by taking the (weighted) average of all the word embeddings. The weights will be calculated using TF-IDF. Lastly, in order to compare the two sentence embeddings we use the cosine similarity metric (for an introduction to the cosine similarity metric, see [Background on Cosine Similarity](#What-is-Cosine-Similarity?)). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is fastText?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fastText is an unsupervised algorithm created by Facebook Research for efficiently learning word embeddings (see [original research paper](https://arxiv.org/pdf/1607.04606.pdf)). fastText is significantly different than word2vec or GloVe in that these two algorithms we saw earlier treat each word as the smallest possible unit to find an embedding for. Conversely, fastText assumes that words are formed by an n-gram of characters (i.e. 2-grams of the word \"language\" would be {la, an, ng, gu, ua, ag, ge}). The embedding for a word is then composed of the sum of these character n-grams. This has advantages when finding word embeddings for rare words and words not present in the dictionary, as these words can still be broken down into character n-grams. Typically, for smaller datasets, fastText performs better than word2vec or GloVe. See this [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html) on fastText for more detail. We will use the pretrained word embeddings for the English language (wiki.en.bin; these embeddings as well as embeddings for 156 other languages can be found [here](https://fasttext.cc/docs/en/english-vectors.html)). These are 300-dimensional embeddings (vectors) trained on Wikipedia data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████| 2.56M/2.56M [01:46<00:00, 24.0kKB/s]\n"
     ]
    }
   ],
   "source": [
    "fastText_model = fasttext.load_pretrained_vectors(dest_path=BASE_DATA_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:12: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).\n",
      "  if sys.path[0] == '':\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:29: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n"
     ]
    }
   ],
   "source": [
    "# Get predictions using fastText embeddings and cosine similarity both with and without stop words\n",
    "baselines[\"fastText Cosine\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, fastText_model, rm_stopwords=True\n",
    ")\n",
    "baselines[\"fastText Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n",
    "    sts_test_stop, fastText_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #6: fastText Embeddings with Word Mover's Distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:13: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).\n",
      "  del sys.path[0]\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:14: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).\n",
      "  \n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:21: DeprecationWarning: Call to deprecated `wmdistance` (Method will be removed in 4.0.0, use self.wv.wmdistance() instead).\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:16: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).\n",
      "  app.launch_new_instance()\n",
      "C:\\Users\\cocochra\\AppData\\Local\\Continuum\\anaconda3\\envs\\nlp_gpu\\lib\\site-packages\\ipykernel_launcher.py:17: DeprecationWarning: Call to deprecated `__contains__` (Method will be removed in 4.0.0, use self.wv.__contains__() instead).\n"
     ]
    }
   ],
   "source": [
    "# Get predictions using fastText embeddings and WMD both with and without stop words\n",
    "baselines[\"fastText WMD\"] = word_embedding_WMD(sts_test_stop, fastText_model, rm_stopwords=True)\n",
    "baselines[\"fastText WMD with Stop Words\"] = word_embedding_WMD(\n",
    "    sts_test_stop, fastText_model, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #7: TF-IDF Embeddings with Cosine Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline first constructs a document embedding based on bag-of-words with TF-IDF weighting (for an introduction to TF-IDF, see [Background on TF-IDF](#What-is-TF-IDF?). Then we apply cosine similarity between the two embeddings in the sentence pair (for an introduction to the cosine similarity metric, see [Background on Cosine Similarity](#What-is-Cosine-Similarity?))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bag-of-Words\n",
    "\n",
    "The most basic approach for document embeddings is called Bag-of-Words. This method first determines the vocabulary across the entire corpus and then, for each document, creates a vector containing the number of times each vocabulary word appeared in the given document. These vectors are obviously very sparse and typical bag-of-words implementations ignore terms whose document frequency is less than some threshold in order to reduce sparsity. We also often ignore stop words as they add little semantic information. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tfidf_cosine_similarity(df, rm_stopwords=False):\n",
    "    \"\"\"Calculate cosine similarity between TF-IDF document embeddings\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe as provided by the nlp_utils\n",
    "        rm_stopwords (bool): whether to remove stop words (True) or not (False)\n",
    "    \n",
    "    Returns:\n",
    "        list: predicted values for sentence similarity of test set examples\n",
    "    \"\"\"\n",
    "    stop_word_param = \"english\" if rm_stopwords else None\n",
    "\n",
    "    tf = TfidfVectorizer(\n",
    "        input=\"content\",\n",
    "        analyzer=\"word\",\n",
    "        min_df=0,\n",
    "        stop_words=stop_word_param,\n",
    "        sublinear_tf=True,\n",
    "    )\n",
    "    all_sentences = df[[\"sentence1\", \"sentence2\"]]\n",
    "    corpus = np.concatenate([df[\"sentence1\"].values, df[\"sentence2\"].values])\n",
    "    tfidf_matrix = np.array(tf.fit_transform(corpus).todense())\n",
    "    num_samples = len(df.index)\n",
    "    \n",
    "    # calculate the cosine similarity between pairs of tfidf embeddings\n",
    "    # first pair at index 0 and n in tfidf_matrix, second pair at 1 and n+1, etc.\n",
    "    df[\"predictions\"] = df.apply(\n",
    "        lambda x: calculate_cosine_similarity(\n",
    "            tfidf_matrix[int(x.name), :], tfidf_matrix[num_samples + int(x.name), :]\n",
    "        )\n",
    "        if (\n",
    "            sum(tfidf_matrix[int(x.name), :]) != 0\n",
    "            and sum(tfidf_matrix[num_samples + int(x.name), :]) != 0\n",
    "        )\n",
    "        else 0,\n",
    "        axis=1,\n",
    "    )\n",
    "    return df[\"predictions\"].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "baselines[\"TF-IDF Cosine\"] = tfidf_cosine_similarity(sts_test_stop, rm_stopwords=True)\n",
    "baselines[\"TF-IDF Cosine with Stop Words\"] = tfidf_cosine_similarity(\n",
    "    sts_test_stop, rm_stopwords=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline #8: Doc2vec Embeddings with Cosine Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This baseline constructs document embeddings using doc2vec and then applies cosine similarity to measure each sentence pair's similarity (for an introduction to the cosine similarity metric, see [Background on Cosine Similarity](#What-is-Cosine-Similarity?))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is Doc2Vec?\n",
    "\n",
    "Doc2vec is an extension of word2vec which produces embeddings of a document. Note that \"document\" refers to some larger chunk of multiple tokens/words. In our case, our documents will actually be individual setntences. The algorithm not only exploits the idea of context words (like in word2vec), but also incorporates the context of the document. There are again two model architectures that parallel those of word2vec: Paragraph Vectors Distributed Memory (PV-DM) and Paragraph Vectors Distributed Bag-of-Words (PV-DBOW). PV-DM randomly samples consecutive words in a paragraph and predicts a center word by utilizing the context words and the paragraph id. PV-DBOW takes a paragraph id and uses it to predict words in the context. \n",
    "\n",
    "See [tutorial #1](https://kanoki.org/2019/03/07/sentence-similarity-in-python-using-doc2vec/) or [tutorial #2](https://gab41.lab41.org/doc2vec-to-assess-semantic-similarity-in-source-code-667acb3e62d7) for more information and an example of using Doc2vec for sentence similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Doc2vec requires unique ids for each sentence, so we'll iterate \n",
    "# through our dataframe, adding a new ID column\n",
    "\n",
    "all_sentences =  sts_test_stop[[\"sentence1\", \"sentence2\"]]\n",
    "corpus = all_sentences.values.flatten().tolist()\n",
    "# Produce dictionary of sentence to id\n",
    "sentence_id = {sent: i for i, sent in enumerate(set(corpus))}\n",
    "\n",
    "def assign_id(row):\n",
    "    return sentence_id[row]\n",
    "\n",
    "sts_test_stop[\"qid1\"] = sts_test_stop[\"sentence1\"].apply(assign_id)\n",
    "sts_test_stop[\"qid2\"] = sts_test_stop[\"sentence2\"].apply(assign_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def doc2vec_cosine(df, rm_stopwords=False):\n",
    "    \"\"\"Calculate cosine similarity between each sentence pair using Doc2Vec embeddings\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe as provided by the nlp_utils\n",
    "        rm_stopwords (bool): whether to remove stop words (True) or not (False)\n",
    "    \n",
    "    Returns:\n",
    "        list: predicted values for sentence similarity of test set examples\n",
    "    \"\"\"\n",
    "    if rm_stopwords:\n",
    "        df[[\"sentence1_prepped\", \"sentence2_prepped\"]] = df[\n",
    "            [\"sentence1_tokens_rm_stopwords\", \"sentence2_tokens_rm_stopwords\"]\n",
    "        ]\n",
    "    else:\n",
    "        df[[\"sentence1_prepped\", \"sentence2_prepped\"]] = df[\n",
    "            [\"sentence1_tokens\", \"sentence2_tokens\"]\n",
    "        ]\n",
    "\n",
    "    # Doc2vec requires data as Tagged Documents with the tokenized sentence and the sentence id\n",
    "    df[\"labeled_questions1\"] = df.apply(\n",
    "        lambda x: TaggedDocument(x.sentence1_prepped, str(x.qid1)), axis=1\n",
    "    )\n",
    "    df[\"labeled_questions2\"] = df.apply(\n",
    "        lambda x: TaggedDocument(x.sentence2_prepped, str(x.qid2)), axis=1\n",
    "    )\n",
    "\n",
    "    # Get all Tagged Documents\n",
    "    df_labeled_sentences = df[[\"labeled_questions1\", \"labeled_questions2\"]]\n",
    "    labeled_sentences = df_labeled_sentences.values.flatten().tolist()\n",
    "\n",
    "    # instantiate Doc2Vec model\n",
    "    model = Doc2Vec(\n",
    "        labeled_sentences, dm=1, min_count=1, window=5, vector_size=500, epochs=30\n",
    "    )\n",
    "\n",
    "    # Train our model for 20 epochs\n",
    "    for epoch in range(20):\n",
    "        model.train(\n",
    "            labeled_sentences, epochs=model.epochs, total_examples=model.corpus_count\n",
    "        )\n",
    "\n",
    "    df[\"predictions\"] = df.apply(\n",
    "        lambda x: model.wv.n_similarity(x.sentence1_prepped, x.sentence2_prepped)\n",
    "        if (len(x.sentence1_prepped) != 0 and len(x.sentence2_prepped) != 0)\n",
    "        else 0,\n",
    "        axis=1,\n",
    "    )\n",
    "\n",
    "    return df[\"predictions\"].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "baselines[\"Doc2vec Cosine\"] = doc2vec_cosine(sts_test_stop, rm_stopwords=True)\n",
    "baselines[\"Doc2vec Cosine with Stop Words\"] = doc2vec_cosine(sts_test_stop, rm_stopwords=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison of Baseline Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our evaluation metric is Pearson correlation ($\\rho$) which is a measure of the linear correlation between two variables. The formula for calculating Pearson correlation is as follows:  \n",
    "\n",
    "$$\\rho_{X,Y} = \\frac{E[(X-\\mu_X)(Y-\\mu_Y)]}{\\sigma_X \\sigma_Y}$$\n",
    "\n",
    "This metric takes a value in [-1,1] where -1 represents a perfect negative correlation, 1 represents a perfect positive correlation, and 0 represents no correlation. We utilize the Pearson correlation metric as this is the metric that [SentEval](http://nlpprogress.com/english/semantic_textual_similarity.html), a widely-used evaluation toolkit for evaluation sentence representations, uses for the STS Benchmark dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pearson_correlation(df, prediction):\n",
    "    \"\"\"Calculate the Pearson correlation between two vectors\n",
    "    \n",
    "    Args:\n",
    "        df (pandas dataframe): dataframe of sentences and their similarity scores\n",
    "        prediction (list): predicted similarity scores for each value in test set\n",
    "        \n",
    "    Returns:\n",
    "        float: pearson correlation value between the actual and predicted score lists\n",
    "    \"\"\"\n",
    "    pearson_correlation = scipy.stats.pearsonr(prediction, list(df[\"score\"]))[0]\n",
    "    return pearson_correlation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Word2vec Cosine': 0.6476606845766778,\n",
       " 'Word2vec Cosine with Stop Words': 0.6683808069062863,\n",
       " 'Word2vec WMD': 0.6574175839579567,\n",
       " 'Word2vec WMD with Stop Words': 0.5689438215886101,\n",
       " 'GLoVe Cosine': 0.6688056947022161,\n",
       " 'GLoVe Cosine with Stop Words': 0.6049380247374541,\n",
       " 'GLoVe WMD': 0.6267300417407605,\n",
       " 'GLoVe WMD with Stop Words': 0.48470008225931194,\n",
       " 'fastText Cosine': 0.6707510007525627,\n",
       " 'fastText Cosine with Stop Words': 0.6771300330824099,\n",
       " 'fastText WMD': 0.6394958913339955,\n",
       " 'fastText WMD with Stop Words': 0.5177829727556036,\n",
       " 'TF-IDF Cosine': 0.6749213786510483,\n",
       " 'TF-IDF Cosine with Stop Words': 0.7118087132257667,\n",
       " 'Doc2vec Cosine': 0.528387685928394,\n",
       " 'Doc2vec Cosine with Stop Words': 0.45572884639905675}"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get metrics on predictions from all models\n",
    "results = dict((model, pearson_correlation(sts_test_stop, baselines[model])) for model in baselines)\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We investigate our 8 models with and without stop words (16 different results total). The results show that TF-IDF bag-of-words document embeddings combined with the cosine similarity performs the best."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAg8AAAEWCAYAAADhFHRsAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nOzdd7xcVbn/8c+XgLRQpBpUiHSBQIAQIAYEwYYNBQVEpKiIXOGCgvITr2C74AXFAsLlqvSOYgGUDgkpBBJCEkqQEqkivQWQJM/vj/UMZ2cy55yZ03JO8n2/XueVmV3WXnvNwF6z9trPo4jAzMzMrFlLLOwKmJmZ2cDizoOZmZm1xJ0HMzMza4k7D2ZmZtYSdx7MzMysJe48mJmZWUvceTCzHiXpR5KekfTPhV2X/kDS+yT9XdIrknZf2PUBkHSApFsr71+RtO7CrFMrJIWk9ZvYbidJj/VFnRY37jyYLeYkzZL0Wl5AnpJ0lqTBXSzr3cA3gU0i4h09W9MB6wfAqRExOCL+WL+yrv2fl3RVtmOfybo91NPlSro5L/Rb1C3/Yy7fqaePaX3DnQczA/hERAwGtgK2Ab7bagGSlgTWAZ6NiH91cf9F0TrA3Z1sU2v/IcBTwK96vVZ9537gi7U3klYFtgOeXmg1sm5z58HM3hIRjwN/BTYDkLSSpN9KelLS43lLYlCuO0DSOEmnSHoOuBm4Dlgrf0Wfndt9UtLdkl7IX6LvrR0vf3V/W9I04FVJS+ayoyVNk/RqHn9NSX+V9LKk6yW9vVLGZZL+KelFSWMkbVpZd7ak0/LX/MuSbpO0XmX9ppKuk/Rcjrp8J5cvIekYSQ9KelbSpZJWaa/dJH1F0gNZzp8lrZXLHwTWBf6SbbJ0J+3/OnA5sEml7I9JulPSS5IelXR8Zd0yks7POr4g6XZJa3b22TWo/1u3AZpos40rbTZT0uc6OifgAmCvyrH3Aa4A/l0pc2lJP5f0RP79vNpW+X14MtcdVFf3pSWdLOmR/AzPkLRsO+f57WyLl7Puu3RSd2uHOw9m9pYcLt8NuDMXnQPMAdYHtgQ+BHy5ssu2wEPAGsAHgY8CT+Qw+AGSNgQuAo4AVgeuplxI31YpYx/gY8DKETEnl+2R5W0IfILSofkOsBrl/1uHV/b/K7BB1mEK5WJVtQ/wfeDtwAPAj/NcVwCuB/4GrJXneEPucziwO/D+XPc8cFo7bfYB4ATgc5SRg38AFwNExHrAI+TIQkS80aiMSlnLAXsBEyuLX6X8cl852+lraps7sT+wEvBuYFXgEOC1XNfZZ9eR9tpseUoH8UJKe+8D/LraYWvgCeCePD55LufWbXMsZTRiOLAFMJIc/ZL0EeAoyvdhA2DXun1/QvmeDM9zfSfwvfpKSNoI+DqwTUSsAHwYmNVBva0jEeE///lvMf6j/A/0FeAFyoXv18CywJrAG8CylW33AW7K1wcAj9SVtRPwWOX9fwGXVt4vATwO7FQ59kEN6rNv5f3vgdMr7w8D/tjOuawMBLBSvj8b+E1l/W7AfZVzubOdcu4Fdqm8HwK8CSzZYNvfAv9TeT84tx1aOZ9dm2z/OZSL7bAOtv85cEq+PggYD2xet00zn92tlXUBrN9Em+0FjK071v8Cx7VT15spHZYvUDqRGwH357rHKt+DB4HdKvt9GJiVr38HnFhZt2GtvoAonav1Kuu3Bx6u/z7m9v+idD6WWtj/3Q30v0X1HqOZtWb3iLi+ukDSMGAp4ElJtcVLAI9WNqu+bmQtSocEgIiYJ+lRyq/Djsp4qvL6tQbvB2cdB1F+FX+WMrIxL7dZDXgxX1ef+phd25fya/3Bduq9DnCFpHmVZXMpF+XH67ZdizLiAUBEvCLpWco5zmqn/Hq7R8T1eT6fAm6RtElE/FPStsCJlFtJbwOWBi7L/c7L87hY0srA+ZRf8evQ+WfXkfbabB1gW0kvVNYvmfXoyB+AnwLPtrPtfN+TfL1WZd3kunU1qwPLAZMr5ylggdszEfGApCOA44FNJV0DfCMinuik7taAb1uYWXsepfx6XS0iVs6/FSOiOkTdWVreJygXHABU/g//bua/AHcnte/nKRfbXSnD90Nrh2pi30eB9TpY99HKea8cEctEmRNSr/4cl6fcQmi0bYciYm5E/IHSURmdiy8E/gy8OyJWAs4gzy8i3oyI70fEJsAo4OOU2wLNfHZd8ShwS127DI6Ir3VyXrMpt5e+RuPOw3xtCKydywCepHxnqutqnqF0Jjet1GelKJNPG9XjwogYnccKyi0P6wJ3HsysoYh4ErgW+KmkFXMS4XqS3t9CMZcCH5O0i6SlKI9xvkEZau8JK2R5z1J+gf53C/teCbxD0hE56W6F/JUP5QL9Y0nrAEhaXdKn2innQuBAScNzkt9/A7dFxKxWT0bFpyhzDe7NxSsAz0XE65JGUjpMte13ljQsRyxeotwumdtDn10jVwIbStpP0lL5t40qk2A78B3g/e20y0XAd7OdV6PMWTg/110KHCBpk5wTclxtp4iYB/wfcIqkNQAkvVPSh+sPIGkjSR/Iz+h1SqdjbrMnbvNz58HMOvJFylD5PZRJg5dT7v83JSJmUu53/4ryK/ETlMmD/+5wx+adSxnGfjzrOLHjzeer28uUSXifoAzT/x3YOVf/gvJr/1pJL2e527ZTzg2UuR2/p/xKXg/Yu8Xz+IukVygdgB8D+0dE7fHOQ4EfZD2+R7mY1ryD8pm8ROls3ELbRbdbn10j2WYfopzfE5R2+wnlVkpn+z4REbe2s/pHwB3ANGA65TbQj3K/v1LmedxImbx5Y92+387lEyW9RJkEu1GDYyxNuf3zTNZ7DUqHxrpAEd0ZMTQzM7PFjUcezMzMrCXuPJiZmVlL3HkwMzOzlrjzYGZmZi1xkChbLKy22moxdOjQhV0NM7MBZfLkyc9ExOr1y915sMXC0KFDueOOOxZ2NczMBhRJ/2i03LctzMzMrCXuPJiZmVlL3HkwMzOzlrjzYGZmZi1x58HMzMxa4s6DmZmZtcSdBzMzM2uJOw9mZmbWEgeJssXC9MdfZOgxVy3sapjZYmzWiR9b2FXoMR55MDMzs5a489DHJJ0i6YjK+2sk/aby/qeSvtGN8o+XdFS+PknSfZKmSbpC0srdq32nxz4qjzdD0l2SvtiFMg7pyn5mZtZ33Hnoe+OBUQCSlgBWAzatrB8FjGumIEmDOtnkOmCziNgcuB/4fy3XtkmSDgE+CIyMiM2AHQG1Wk5EnBER5/Z0/czMrOe489D3xpGdB0qnYQbwsqS3S1oaeC9wp4qT8lf8dEl7AUjaSdJNki4EpueyYyXNlHQ9sFHtQBFxbUTMybcTgXfl9rdJeqvDIulmSVtLWl7S7yTdLulOSZ/K9YMknZz1mCbpsAbn9R3g0Ih4KY/9YkSck/vvkuVNz/KXzuUnSronyzw5l1VHTm6W9BNJkyTdL2mHSn1OynpOk/TV7n0kZmbWCk+Y7GMR8YSkOZLWpnQiJgDvBLYHXgSmRcS/Je0BDAe2oIxO3C5pTBYzkjKi8LCkrYG9gS0pn+cUYHKDQx8EXJKvLwY+BxwnaQiwVkRMlvTfwI0RcVDe4piUHZIvAu8BtoyIOZJWqRYsaQVghYh4sP6gkpYBzgZ2iYj7JZ0LfC3//TSwcUREB7dUloyIkZJ2A44DdgW+BLwYEdtkR2ScpGsj4uG6Yx8MHAwwaMUFMsqamVkXeeRh4aiNPtQ6DxMq78fnNqOBiyJibkQ8BdwCbJPrJlUulDsAV0TE7PzV/+f6g0k6FpgDXJCLLgU+m68/B1yWrz8EHCNpKnAzsAywNuWCfUZtFCMinqs/BBDtnOtGwMMRcX++P4dyS+Ml4HXgN5I+A8xuZ/8/5L+TgaGVen4x63kbsCqwQf2OEXFmRIyIiBGDllupneLNzKxVHnlYOGrzHoZRbls8CnyTckH9XW7T0XyBV+vet3fhRtL+wMcpv/wDICIel/SspM2BvYDasL+APSJiZl0ZHXUOiIiXJL0qad2IeKi+Cu3sM0fSSGAXysjJ14EPNNj0jfx3Lm3fVwGHRcQ17dXJzMx6j0ceFo5xlAv6czmy8BywMuXWxYTcZgywV97fX53ya31Sg7LGAJ+WtGzePvhEbYWkjwDfBj4ZEfW/7C8GvgWsFBHTc9k1wGHZWUDSlrn8WuAQSUvm8lVY0AnAaZJWzG1WzNsG9wFDJa2f2+0H3CJpcB77auAIyi2aZl1DufWxVB5rQ0nLt7C/mZl1g0ceFo7plHkMF9YtGxwRz+T7Kyidibsov/q/FRH/lLRxtaCImCLpEmAq8A9gbGX1qcDSwHXZH5gYEYfkusuBXwA/rGz/Q+DnwLTsQMyidHJ+A2yYy98E/i/LrjodGEyZm/Em8Cbw04h4XdKBwGXZ+bgdOANYBfhTzokQcGSnrdbmN5RbGFOynk8Du7ewv5mZdYNyJNtskTZixIi44447FnY1zMwGFEmTI2JE/XLftjAzM7OWuPNgZmZmLfGcB1ssODGWmQ0EAyV5lkcezMzMrCUddh7kJE6dldFnSZwkXS1p5fw7tLJ8J0lXNrH/dhmWeqqkeyUdX9l/VCe7N1vHKyTtXnk/U9J3K+9/nwGhulr+2ZL27G49zcysezobeXASpw70ZRKniNgtIl6gxIM4tLPtGzgHODgihgObUaJMAuxEW66N7qp+X1YFXqE8blqzPW0RNDtUiylhZmb9T2edBydx6oMkTpK+JenwfH2KpBsrdTk/X8+StBpwIrBejiCclEUMlnR5jqRckLEP6q0BPJnnOzci7pE0FDgEODLL20HSOpJuyLreoJKDo/ar/wxJY/P8Pt7gGNXvyyjgSmD1/H68B3gtY1UsI+msbOM7Je2cxzhA0mWS/gJcm/udmu1+VZ5Drc0W+DzMzKxvdPjrzkmc+iyJ0xhKeOpfAiOApVWiJ45m/qBPAMdkew7POu+U7bkp8ATlAv4+4Na6/U4BZkq6GfgbcE5EzJJ0BvBKRNQ6RH8Bzo2IcyQdlHWq3YoYCrwfWA+4SdL6EfF65RiTgc0kvY3yfbkFWJfSydyStlGq/wCIiGEqQa+ulbRhrtse2DwinlO5xbERJYz3msA9wO/yM+3085ATY5mZ9YpmJkw6iVPvJ3GaDGydHZs3KG08gtJe9Z2HRiZFxGMRMY8SaXJo/QYR8YMs81rg85QORCPb0xb58jzKZ1tzaUTMi4i/Aw8B9dEu3wDuBrYCtsvzbe/7cl7ucx8lMmat83Bd5TPbkbbv1RPAjbm8qc/DibHMzHpHM52H+iROEykXmOp8h55O4rRvNYkTUE3idHHlmHtExPD8Wzsi7qXjzgHZaXlV0rqNqtDOPnMoIyi/p/wKb+/C21ESp1o93xMR19aV/yYlFPSBlPYeC+xM+YV/b3vn0uC49ceuP48HI+J0SjKqLVTmJXQm2nnd6D2U+u9IGd15nvJ9qXUeeuT70sLnYWZmvaDZkQcncer9JE5jgKPy37GUuQhTa52oipeBFVo4Pnncj1XmQmxA6WS80KC88ZRbSwD7Mv/tj89KWkLSepTbEfNl30zjKFk678r30yijEGtTRiWgnOO+Wa8Nc12jssYAe+f3agilQ0U3Pw8zM+umZma0O4lT3yRxGgscC0yIiFclvU6DWxYR8aykcZJmAH8Fmo18tB9wiqTZlNtC+0bE3JzjcLnKhNPDgMMp8wqOzroeWCljJuWW1JrAIXXzHWrGUzoWJ2R950j6F/Bo3lYB+DVwhqTpWZcDIuINLTjP8wpKmu7plCdwbsnlK9D1z8PMzLrJibGsKZLOBq6MiMsXdl26womxzMxaJyfGMjMzs57gQDzWlIg4YGHXoTuc28LM+puBkseiEY88mJmZWUvceehHtAjmEpG0Rca4qL3fR9LsytMnwyRNy9c3S3qk8lQIkv4o6ZV8PVTSaypRKe9Viea5f2/U28zM2ufOQ/+yKOYSmQ6sk4/mQjmH+ygRJ2vvq+f0AiVCJtmhGVJX3oMRsWVEvJfySOmR+YSMmZn1EXce+pdFLpdIPp55O7BtLtoaOI35c2BUk2VdTFucic/QFrVzARHxEPANyuOlZmbWR9x56EcyBHN9LpHbKDE0RpC5RCgX1VoukV2BkzKIEpTIi8dGxCaaP5fIZ2gLGV7vIErMCGjLJYIquUQoMShujIhtKMGaTspgVwfTlktkc9rCileNB0bl9vMo4cSrnYfqyMMNwI45crI3bTlO2jOFujDZNZIOlnSHpDvmzn6xk2LMzKxZ7jz0P4taLpHqOY0Ebs+kZOurRCMdnCMINXMpUS33ApaNiFntN1U5hfZWOLeFmVnv8KOa/U99LpFHKRk3XwJ+l9v0dC6RXaq5RCRVc4nUUojXconMrCujw1wiaSKlczOatpDmj1FGFsY32P5iSnTJ4zspF8qoSjP5P8zMrId45KH/WeRyiUTEy5RO0AGVc5hAyUvRqPMwlhLe+qIG694iaShwMvCrjrYzM7Oe5c5D/1PLJTKxbtmLdblEplFyidxI5hKpLygiplDmDEylZKCszyWyAiWXyFRJZ1TWXU4ZFbi0suyHwFKUnCEzaMsz8hvgkVx+FyXddyPjgKUj4tF8P4GSA2OBzkMUJ1fOt2q92qOaWb9fRcRZ7RzTzMx6gXNb2GLBuS3MzFrn3BZmZmbWI9x5MDMzs5b4aQtbLDgxltniaSAnn+rPPPJgZmZmLely58FJnPouiZOktSRdnq+HS9qtsu6tduqkjIMqIaRnVMJLHyBpra7Uq658SXpG0tvz/RBJIWl0ZZunJa3ajWPMkrRad+tqZmbd052RBydx6qMkThHxRETsmW+HA7t1tH09Se+ihJcenW24HeVRTyixF7rdecggU7VQ2lDa6k7aviMbAc9ExLNN1tm31MzM+qnudB6cxKmHkjhJujojOpL1/V6+/qGkL+coxgxJbwN+QAkQNbXWlsAmee4PSWqUJGoN4GXglazLKxHxsKQ9KTkzLsjylpW0S9Zherbh0lmXWZJ+kiMokySt3+A41e/EKOBnzN+ZGJ9lrSPphvwMblDJ5YGksyX9TNJNwE8krSrp2qzP/5KRNfPzvUrSXdkue2FmZn2my50HJ3ECei6J0xhgB0krUvJMvC+Xj6YS2Cnb83vAJRExPCJqx9sY+DClPY9T3mapuAt4CnhY0lmSPpHlXQ7cAewbEcMpYabPBvaKiGGUCbVfq5TzUkSMpASY+nmD83hrNCrr8kfg3fm+2nanAudWPoNfVsrYENg1Ir4JHAfcGhFbUvJyrJ3bfAR4IiK2iIjNgL81qIsTY5mZ9ZLuTph0EqeeSeI0lhJiejRwFTBY0nLA0PpcEu24KiLeyIiM/wLWrK6MiLmUC+6elNs+p0g6vkE5GwEPR8T9+f6crFfNRZV/t2dBk4Ats+O1VES8AjyUoxTV78T2wIX5+rw875rLsr7ksc/Pc7gKeD6XTwd2zZGQHSKiYc/AibHMzHpHdzsP9UmcJlIuDNVfmT2dxGnfahInoJrE6eLKMffIX+fDI2LtiLg3l/dGEqdfMX8o5/a0l8TpdspozQ6UUYg7ga8Ak5soE+CNyuu5NHgEN0M+T4qIEyjnskeDcjr6rGD+tlugHTNHxgOU0aEpuXgiZY7GGkB7HaFqWZ1+J7JzszWlE3FC7TaPmZn1jZ4YeXASp24mccrbEY9SRk8mZplHMX8uipqXKTkpmqbytMZWlUXDgX80KO8+YGhlPsN+lJGimr0q/06gsXGUtqq23X8CE2udPko71uaK7EsZvWlkTK5H0keB2pMcawGzI+J8Sptu1c7+ZmbWC7rbeXASp55L4jQWeCo7R2Mpk0IbdR5uokyQrE6Y7MxSwMkqj7tOpVz8/zPXnQ2ckcsFHAhcJmk6Zc5Hta2XlnRb7ntkO8caR2mrWudhSp5Lte0OBw5Ueex1v0pd6n2fMqdkCuVW1CO5fBgwKet8LPCjjk/fzMx6khNjWVMkzQJGtNNJ6vecGMvMrHVyYiwzMzPrCQ7EY02JiKELuw7d4dwWZjYQDJRcHB55MDMzs5a48zCASVpT0oUZWXKypAmSPp3rdpJ0ZZPlHC/phLplw3OiZyv1OSonZc7I6I9fbGX/LOOQruxnZmZ9x52HASofQ/0jMCYi1o2IWoTOd3WhuItoewyzZm/aAjk1U59DgA8CIzPq4450HjdiARFxRkSc2+p+ZmbWd9x5GLg+APw7It56lDIi/hERC8SRqGqUuyKjWL4gadvKpp8jg25J+lCOakyRdJmkwQ2K/g5waEYHJSJejIhz2jtmLj9R0j2Z4+LkXFbNpnpzJZ/G/ZJ2yOWDVPKl3J77frVrTWhmZl3hzsPAtSltURybImkZ2s9dcREZuEnSdsCzEfF3lRTY36Xkm9iKkgvjG3XlrgCskKG8mzpmBuj6NLBp5rhoL1bDkplP4whKrguAL1FiiWxDiQb6FUnvaXBs57YwM+sF7jwsIiSdlvMMbu9gs45yV1wM7KmSXn1v2iJmbgdsAozLoEz7A+vUH572w363d8yXgNeB30j6DFAfObSmlql0MjA0X38I+GLW5zZgVWCD+h2d28LMrHf4Uc2B624q+Ski4j9ylKCjSEjtzkGIiEczENT7s9ztK/tcFxH7dLDvS5JelbRuXeKwdo8ZEXMkjQR2oXRWvk65FVOvlrejmrNDwGERcU17dTIzs97jkYeB60ZgGUnVlNnLdbJPZ7krLgJOAR6MiMdy2UTgfbV9JC0nacMGZZ8AnKaSVhxJK0o6uL1j5ryJlSLiasotieGdn/JbrqHc+lgqj7WhSiZPMzPrAx55GKAiIiTtTkmv/S3gaUpGym9XNttF0mOV95+lLXfFkpRsntXcFZcBvwAOqxznaUkHABfVJjpS5kDcz/xOBwYDt0t6E3gT+GlEvC6p0TFXAf6UcyJE+7kyGvkN5RbGlHzq5Glg9xb2NzOzbnBuC1ssOLeFmVnrnNvCzMzMeoQ7D2ZmZtYSz3mwxYITY5lZVwyURFV9zSMPZmZm1pI+6zw4iVP3SLpa0sr5d2hleVNtJ2k7SbdJmirpXknHV/Yf1UN1vCKfAKm9nynpu5X3v8+AUF0t/2xJe3a3nmZm1j190nlwEqfui4jdIuIFYGXg0M62b+Ac4OCIGA5sBlyay3cCeqTzAIyvlSVpVeAV2oJNka/HN1NQPtZpZmb9UF+NPDiJUwdJnCR9S9Lh+foUSTdW6nJ+vp6VESRPBNbLEYSTsojBki7PkZQLsrNWbw3gyTzfuRFxj6ShwCHAkVneDpLWkXRD1vUGSWvn8c+WdIaksXl+H29wjHG0dURGAVcCq6t4D/BaRPxT0jKSzso2vlPSznmMA/Iz+wtwbe53arb7VXkOtTZb4PMwM7O+0VedBydx6jiJ0xhgh3w9gtIZWAoYDYyt2/YYSgTI4RFxdC7bMo+5CbAu8L4GdTsFmJm3Fr4qaZmImEUJ2HRKljcWOBU4N8/zAuCXlTKGUsJXfww4I9urajKwmaS3UToPE4CZwHvz/bjc7j8Aso33Ac6plLU9sH9EfIDS5hsBw4Cv0Daq0dTnISfGMjPrFQtlwqScxKk+idNkYOvs2LxBueiOoHQo6jsPjUyKiMciYh4wtXLst0TED7LMa4HPA39rp6ztabsFdB6lA1NzaUTMi4i/Aw8BG9cd4w1Kzo2tKJ/FbXkuo/KvdstidJZNRNwH/AOohby+LiKey9c7AhflSMkTlJDc0OTn4cRYZma9o686D7ULClCSOFESIq3ewT4dJnECZtGWxOnSyj7X5a/o4RGxSUR8qW7fl4BXJa3b7DEjYg4wEvg9JQxyexfejpI41er0noi4tq78N/N8DqRcYMcCOwPrAc1MBH2j8rp67PrzeDAiTqe0/RY5L6Ez0c7rRu+h1H9HyujO85TcGLXOQ23koaP5Ja92dowWPg8zM+sFfdV5cBKnzpM4jQGOyn/HUuYiTI0F44e/DKzQwvHJ436sMhdiA0on44UG5Y0nbwkB+wK3VtZ9VtISktaj3B6Z2eBQ44CvAnfl+2mUUYi1KZ1IKOe4b9Zrw1zXqKwxwN45b2QIpUNFNz8PMzPrpj6Z0e4kTk0lcRoLHAtMiIhXJb1Og1sWEfGspHGSZgB/BZqNfLQfpf1nA3OAfSNibk5OvFzSpyhteTjwO0lHZ10PrJQxk9KBWxM4JCJeb3Cc8ZSOxQlZ3zmS/gU8mrdVAH5NmTMxPetyQES80WCe5xWUybbTKZ9hrfO4Al3/PMzMrJucGMuaIuls4MqIuHxh16UrnBjLzKx1cmIsMzMz6wkOxGNNiYgDFnYdusO5LcystyyO+S888mBmZmYtcedhEaV+kktE0hYZ46L2fh9JsytPnwyTNC1f3yzpkcpTIUj6o6RX8vVQSa9lVMp7VaJ57t9MPczMrOe487AIyotvf8klMh1YJwNgQYn3cB8lKmbt/bjK9i+QETIlrQwMqSvvwYjYMiLem/U4Mp+QMTOzPuLOw6Kp3+QSycczbwdq+28NnMb8OTCqybIupi3OxGdoi9q5gIh4iBJ+/PCOzsvMzHqWOw+Lpn6TSySNB0ZlcKx5wM3M33mojjzcAOwoaVAe85JOqj6FujDZlXNybgszs17gzsNiYCHnEoG2bJsjgdszKdn6klYHBucIQs1cSlTLvYBlM3lXh6fX3grntjAz6x1+VHPRdDcl5wdQconkKEFHUZI6zCUiaRZtuUS2r+xzXUTs00l9JlIyio6mJMoCeIzSERnfYPuLKdElj++kXChzJ5qavGlmZj3DIw+Lpn6VSyQiXgYeBQ6grfMwgZKXolHnYSwlvPVFDda9RdJQ4GSgw7kcZmbWs9x5WARlMq3dgfdLeljSJMptiAVyidT+KL/ga3k9plPmJtTnEtmUnCiZx3ma0iG4KB+3nEg78w8oty6WzoyoUDoP69Kg8xDFyRHxTINy1qs9qknJpvqriDiro/YwM7Oe5dwWtlhwbgszs9Y5t4WZmZn1CHcezMzMrCV+2sIWC06MZda7FsfkUIszjzyYmZlZS/pF58FJnDqt11qSLq+cz25153xUE2UclGGnp0maIelTufwASWt1pV515UvSM5Lenu+HSHcQ6GwAACAASURBVApJoyvbPC1p1W4cY1bGqzAzs4VooXcenMSpcxHxRETsmW+HA7t1tH09Se8CjgVGR8TmlMiQ03L1AUC3Ow/5eOhttAWQGgXcmf8iaSPgmYh4tsk6+5aamVk/tdA7DziJE5KulrR5vr5T0vfy9Q8lfTlHMWZIehvwA2AvSVMl1TpKm+RIyEOSGiWJWgN4GXgl6/JKRDwsaU9gBHBBlrdso3bNusyS9JMcQZlUCSZVVQtDXWurnzF/Z2J8lrWOpBtyFOQGSWvn8rMl/UzSTcBPJK0q6dqsz/+SUTAlLS/pKpWQ2zMq7WBmZn2gP3QenMQJxgA7SFoRmEOOZlDCOY+tbRQR/wa+B1wSEcMjona8jYEPU3JHHFe7zVJxF/AU8LCksyR9Isu7nNIO+0bEcCBov10BXoqIkcCpwM8bnMd42tpqJGVE6d35vtp2pwLn5ijIBcAvK2VsSPmMvgkcB9waEVsCfwbWzm0+AjwREVtExGbA3xrUxYmxzMx6SX/oPMxHi2cSp7FZ/9HAVcBgScsBQ3M0pTNXRcQbGZHxX8Ca1ZURMZdywd0TuB84RdLxDcrpqF2hrS0vom1EoWoSsGV2vJaKiFeAh3KUojpqsz1tt5LOy/OuuSzrSx77/DyHq4Dnc/l0YNccCdkhIhr2DJwYy8ysd/SH+8pO4lRulYwAHgKuA1YDvgJMbqJMgDcqr+fS4HPNOQmTgEmSrgPOalDndtu1Vkw7r2vHmC3pAeAg2kaTJlLmaKwBtNcRqpb1agfrase5X9LWWe4Jkq6NiB90UnczM+sh/WHkYbFP4pS3Ix6lzM+YmGUeReWWRcXLwAoNlnd07LUkbVVZNBz4R4PyOmvXvSr/TqCxcZS2qrbdfwIToy0W+nja5orsSxm9aWRMrkfSR4HakxxrAbMj4nxKm27Vzv5mZtYLFnrnwUmc3jIWeCoiZufrd9G483ATZYJkdcJkZ5YCTpZ0X96y2YtyQYcyx+GMXC46btelJd2W+x7ZzrHGUdqq1nmYkudSbbvDgQPzc9ivUpd636fMKZkCfAh4JJcPo4ygTKU8RfKjjk/fzMx6khNjWVPyVtCIdjpJ/Z4TY5mZtU5OjGVmZmY9oT9MmLQBICKGLuw6dIdzW5jZQDEQ8oR45MHMzMxa4s5DPyXpcJWcGBe0uN/Kkg7N18NyYuVUSc/lhNSpkq7vQn3WlbR3B+s3lvRXSX/Pel8saY0WjzFIUqNJomZm1o+489B/HQrsFhH7trjfyrkvETE9I1EOp0RoPDrf79qF+qxL2+OV85G0LHAl5WmSDTKXx/8BLSXBioi5EbFDF+pmZmZ9yJ2HfkjSGZSL9Z8lHSlppKTx+djneJUkU0jaNPNMTM08ERsAJ1IeEZ0q6aROjnNM7j9Nbfk0ts993yZpsKR7JL03y90519Xnz9iPktjs6tqCiLghIu5VyZdxjkqujCmSdszjDJN0e6Xu60paUtILuX5XlbwXf5A0U9K5lXpvI+kWlQysf5W0JmZm1mc8YbIfiohDJH0E2DkinlHJebFjRMyRtCvw35TomYcAv4iIC1SSZg0CjgE2y9GGdqmk9V6bkgRMwNWSRkXEeEl/oyTgejtwVnYCjgG+HhG7NyhuM9qPhnk4JfHZMEmb5nE2oIyOnBwRl6gk32oU3XIrSkjxfwETVXKV3An8Avhkts2+wA+Bgxuc48G15YNWXL2j5jAzsxa48zAwrASckxfdoAR9ghKI6ViVlNt/yARgzZb5IeCjlIsxwGBKUqrxlIRUk4GXmD8xVleMBk4CiIi7JT0BrJ/H+a6kdbLuD2jBNNwTI+JJgAwINRR4nRIA7Po810GU8OELiIgzgTMBlh6ygQOamJn1EN+2GBh+CNyUGSQ/ASwDEBEXAp8EXgOukfSBFsoU8KPanIiIWD8izs51q1FChK8ILN1EWXdT0pe3d5wFRMR5wKcpeTmuq93OqNMoZ4eAaZV6D4uIjzZRRzMz6yHuPAwMKwGP5+sDagslrQs8FBG/pEyI3Jzmc19cA3xJJQMmkt6lkpAMyq/1Yyhhvk/IZR2Vex4lvPhHKnXbTdImzJ+f4r3AEOABSetGxAMR8QtKJtHNm6gzwD3AOyWNzDLflrdDzMysj7jzMDD8DyV75DjKMH3NXsCMHNLfGDg3Ip6lpB2f0dGEyZzceDllLsF0Su6NwZIOAl6NiEuBH1OSib2fcntjkEq69MPryppNGRE5Mh/VvAf4AvA0JRHYsnmMC4AvZiKwz0u6O+u+Lpl6uzMR8QYltfjPJN2V9dq2mX3NzKxnOLeFLRac28LMrHXObWFmZmY9wp0HMzMza4kf1bTFghNjmdlA1F+TZHnkwczMzFrSY50HJ3LqXZI+LenofP0ZSRtX1t0qqbOIkoMknZZPYUzPsNTrSFoio0f2RB23lnRH5f1+kl6RNCjfbylpSjfKXz+fzjAzs4WoJ29bHAp8NCIebnG/WiKnX0fEdGA4gKSzgSsj4vIu1qeWyOni+hVqS+R0eC0fg6RdKImc/tXsASJiLtAniZwi4orK288A84D7Wiji85Tz2zwi5klamxJBcglKTIcTe6CadwHrS1ouH98cBdwPbAFMyffjmi1M0pIRMacH6mVmZj2oR0Ye5ERO3UrklOU8lK9XkzRP0qh8P0HSUElflvRzSTsAuwGnZF2GZjF7Z9vMrO1bZwjwZETMy/N9JCJeyHZaIcs6N4/5rRyhmCHpsFy2fsZlOC/b5tLshL0lL/RTgJG5aEvgdEqngfx3fJb3wTzmdEn/p5KbA0mPSfovlZgWn862myZpAiWXR63NFvg8GpyzmZn1gh7pPETEIcATlEROp1B+Ee8YEVsC36MkcoK2RE7DgRGUnATHAA9mqOGj2zuG5k/kNBwYpZLIaQJQS+T0UzKRU5Z7U5b7y7rimkrkROlknJcXtloip+HANnm+9bYC/oOSzOm9krZTSfr0C2CPiNiaEgzph3XtNwd4KDtZo7NuO+TFeY2ImFXZdixwNXBknlttnSJiJHA0pc3rXQx8Jjt0J1ducxwDvJxlfVElcuO+lA7A9sChkmrRHzcBTsu2eR34aoPjjKd8NitQwkuPYf7OwzhJywG/yzYZRgmFXU1s9WpEvC8iLgPOBr4WEdszf4CsTj8PSQdLukPSHXNnv9igqmZm1hW9NWFyJeAySTOAUyiJjKAkcvqOpG8D60TEay2UWU3kNIWSXGnDXHcc8HFgGKUD0R2jKeGWiYi7KRelaiKnbwHvjojXG+w7MSKezNsZtURO76UtkdNUysX63Q32HQvsmH8nUG6HbAvc1mS9/5D/Ts7jziciHgE2Ao7NRTdJ2qlBOTsAv4+I2RHxMvBHSpsAPBwRE/P1+ZXlVeMonYTtgEkRMRPYSNI7gKWyHu8F/h4RD+Y+51LOu+YSKKMwwLIRUbvVcV5lm04/j4g4MyJGRMSIQcut1KCqZmbWFb3VeXAip9YTOY2lXLhHUOZjrEa5oI5p4nyqx64dt9E5vB4RV0fEUcBPgE812KyjtJz14UgbhSedQOn0vC9fA/wT+Cxt8x06S/35aifHaPbzMDOzXtCbIw9O5FQ0m8hpAvB+yi2TfwPTga9QOhX1mm2zt6g8CTEkXy9BGaX5R21CotrSYY+hzDVYVtJgSgejVof3SNomX+8D3Fp/nJxH8RTllk+t8zAROIKc70Bpkw0q8xS+ANzSoKxngNclbZ+L9q2cT1c/DzMz66be6jw4kVPbsZpK5JS3cJ6g7QI7ljKack+DYi+i3P6pTpjszDuAq/JW0nTK6M/pue63wDRJ50bEpCz/dspF//R8CgbKiM1XJE0Dlqd02hoZBwyKiCfz/QRKm43Pc50NfAn4Q7bzG8D/tVPWgcD/5oTJVyrLu/R5mJlZ9zkxljVF0vrA5TlBccBxYiwzs9bJibHMzMysJzi3hTUlIh4gA3gNRM5tYWZ9rb/mpegJHnkwMzOzlrjzMEBpAOUSkfQXSR+vvH9QlXwakv4k6ZMqUTpD0v6VddvksiPy/flZz7sk3a8SDXStVutrZmZd587DwHUosFtE7NvplvOr5RIhIqbX4k9QHp09Ot/v2oX61HKJNDKejDKpEpr7BUr0yprtaHvKZHpdOXtTcmZUHRkRW1Ce2JkO3ChpqS7U2czMusCdhwFIAy+XSC3qJPnvH4G1srwNgBcypgPAQ8CKKjk+BHyQEuNjARExLyJOBp6jRCA1M7M+4AmTA1BEHJIBrnaOiGckrUjJJTJH0q6UXCJ70JZL5AKV/ByDKMG0NuvskUvNn0tEwNUquUTGS6rlEnk7mUskb0N8PSJ2b1Dc7cDwDEQ1itIZ2ETShpQRiPpMm7+nxMa4lxKe+81OmmQKZRRivhmRkg4mc2YMWnH1ToowM7NmufOwaFgJOCd/xQdQG8KfABwr6V3AHyLi7+XHfFOquUQABlNyiYyn5BKZTEnp/bXOCoqI1yTNpDytsS0lfPkmlI7E9rTdsqi5hBIF9H5KwKrOwpi3F1L8TDKQ1dJDNnBAEzOzHuLbFouG/p5LBEoHYSdgmYh4iRK9clT+zTfyEBGP5/HfD9zcRNnDKaMUZmbWB9x5WDT091wiUDoIX6NtJONOSiKwd1BSuNf7L+DbETGvvQJVHAmsClzXxDmZmVkPcOdh0dCvc4mkcZRJnhOy/DeBZylpuxe4pRARt0bEn9up3imZJ6R2K+QDWZ6ZmfUB57awxYJzW5iZtc65LczMzKxHuPNgZmZmLfGjmrZYcGIss/5rUU4gtajyyIOZmZm1pE86D07i1P0kTpJ+LGnnfP0NScvk6yUlvdDE/kMkXZ11uUfSnztriy7U8ZuSTq68/21Go6y9P1LSz7pR/pcl/by79TQzs+7pq5EHJ3HqZhKniDg2Im7Kt98gA0G14EfAVRGxRURsAnw3l3fUFq16q+3SMGBVSbXv2QIBodqTMRw8MmZm1g/1+v+c5SROQMdJnCSNknRpvt5D0quSlpK0vKS/5/LzJe2eQZHWAMZWR10knZijChMkrdGgCkOAxyr1mZYv52sLScvmCMl0SVMk7Zjlf1nSFZKukTRT0ncbHGMyJWfF0pJWoYSvnkEJRQ2VUNSSvqUSa2KGpMNy2fr5/gxKvoohedz7Jd1M6bjVznfv3PYuSTdhZmZ9ptcnTDqJ0wIaJXG6Hdg6X+8A3ANsRcknMbG6c0ScIumbwA4R8ULWcyXglog4Jm8LHETpFFSdClwoaQpwfbbFk5Q2fqstJH0b+HdEDJO0KaUtN8gyRgKbAf8Gbpd0ZURMrdTt35Jm5Lm8Pev+KDBK0stZ7pOSRgL7ZnmDgEmSbgFmUzoaB+b35l2USJNbUSJYjqm0x3HAThHxlKSVGzW0nBjLzKxXLIxh4ZWAy/IicwqwaS6fAHwnL17rRMRrLZRZTeI0BVifksQJykXm45Qh9J92VlAet5rEaVLWrZaHoVESp88B+1CSOHVmgSROGR3xkbxIjwB+DuxI6UiMbaLM1yLir/l6MjC0wTGuBtYDfku5QN8padUGZY2mJKUiIu4GnqC0J8A1EfF8RLxKGZEZ3WD/2sjNKEq7NWq7HYDfR8TsiHi5rqwHI+L2fL0dcENEPBsR/6ZEuawe51xJX6ad73FEnBkRIyJixKDlVmq0iZmZdcHC6Dw4iVPjJE5jgY9Rfn3fQLnAjqb82u7Mvyuv59LOiFJehC+IiC8AU2l88e8o7WZ9ONJG4Ulr8x62p7TbDMpoRbXtOjrGq00cA+ArlI7hUOAuSW/voEwzM+tBC2vkwUmcFjSGMhFyfET8M4+1XkQ0Ol6z7VI9/i6Sls3XKwLvAR5pUNYYyi0Fcn7IEOCBXPchlSdglgM+RePJj7WRh5WzszKPMun0Y7SNPIwBPp3zKwZnWY1GWCYCu0haJW9l7VlZt25ETKS0/fPAO5tvDTMz646FESTqf4BzJH0DuLGyfC/gC5LeBP4J/CAinpM0Lm9x/DUijm5UYERcLWljShInKBfEz0v6JJnEKecGTFBJ4jSBTOIE/DY7LFULJHGS9Cylc9MwiVMH53uKpO8Dy2Z57SVxmkC5UNdGGmZQnjZp5EzgekmPAh/p4NhV2wCnZvsuAZweEXfW5pfU2gL4FfC/Ksmw3gS+mHMZAG4FLqTc/jivOt+hJue1vAhMqyyeSJnfMD23mSTpIspcD7Iu0yWtX1fWY5J+lPs/AVSTU5wi6T2UUYxrI2JGk+1gZmbd5MRY1pScW7BZRByxsOvSFU6MZWbWOjkxlpmZmfUEjzzYYmHpIRvEkP0dnNLMBo7+kPPDIw9mZmbWI9x5GKAkraq2XB//lPR45X1UXk+VNLTB/mdL2jNf35xRI6dJuk/SqdXAS5LmNlHehiq5Mx5QyWNyqUp471bPqz6OhpmZ9TNOyT1ARcSzlJgRSDoeeCXDXyPplc6icjawb0TckU9fnAD8iRK7AkoQqnbLU0nSdRXwjYj4Sy7bGVgdeKqVSkTEqM63MjOzhckjDzafjOT4LWBtSVs0udvngQm1jkOWc1NEzJC0jKSzVHJl3Km2zKCNcpkg6ZX8d6ccEbk8R0MuUD4vKmlrSbdImqySa2NIT7aBmZl1zCMPi6ZlJdViMDwcEZ9uZeeImJtxHzamZAntrLzNKGGxG/mPLHNYxuK4ViVPSKNcJvW2pIQvf4ISe+N9km6jxKL4VEQ8LWkv4MeUfB7zcW4LM7Pe4c7DoqnD2wxNqoaQ7k55oykXeyLiPkn/oOQdmQAcm8mv/hARf2+w76SIeAwgOy9DKdEqNwOuy4GIQcCTjQ4cEWdSAmqx9JAN/FiRmVkP8W2LxUTeOpgq6eomth1ESSTWKAdHI3fTlhV0geIaLWwyl8kblde1nB0C7q7kMRkWER9qsK+ZmfUSdx4WExFxYF5sd+toO0lLUSZMPhoR0zratuJCStrttx5KlvQRScOYP1fGhpTU6TPVOJdJM2YCq0vavlZfldThZmbWR9x5sJoLJE2j5NRYnpKsqimZxvzjwGGS/i7pHkrSs38Bv6bkzphOSV9+QES8QcllMiNvR2wMnNvksf5NSZD1k5yXMZWSiMvMzPqII0zaYsG5LczMWucIk2ZmZtYj3HkwMzOzlvhRTVssTH/8RYYec9XCroaZWZ/qreRaHnkwMzOzlvRZ58GJnLpH0iclHZOvd5e0SWXdzZIWmNBSt/8Skn4paUaGir5d0nty3Xd6qI5bVCJRImkfSbPz8U8kDcsnOrpa/lBJM3qirmZm1nV9dtvCiZy6JyL+TImHALA7cCVwTwtF7AWsBWweEfMysuOrue47wH/3QDWnA+tIWiEiXqY8QnkfJcz0pHw/rtnCJA2KiLk9UC8zM+tBA/62xaKQyEnSIEkPqVhZ0jxJO+a6sZLWl3RAjrCMokRmPCnrsl4W89ms3/2SdmhwzkOAJyNiXp7vYxHxvKQTydwVki7IY34jRyhmSDoilw3N8zonz/9yScvVfRbzgNuBbXPR1sBptMVhGAWMz/J2yfadLul3kpbO5bMkfU/SrXlOW0u6S9IEMk9GR5+HmZn1vv7SeahdvKZKuqLVnfPXaS2RUzPlNZXICdgHOCdHKmqJnIYDI4DHGuy7JXAEsAmwLiWR01KU3A57RsTWwO8oiZzq639/7jc667ZDXlDfFREPVLYdTxmBODojRj6Yq5aMiJF5/OMa1O1S4BPZJj+VtGWWdww5UhMR+0raGjiQ0gHYDvhKbVtgI+DMiNgceAk4tMFxxlOiTS4PzANuZv7Ow7hsz7OBvbKdlwS+Vinj9YgYHREXA2cBh0fE9nXH6fTzkHSwpDsk3TF39osNqmpmZl3RXzoPtYvX8FYzQFYskMipi+WNBs6DksgJqCZy+o6kbwPrZFTFepPyF/08SuTDoZQLbi2R01Tgu8C7Guw7Ftgx/07IemxD+SXfjD/kv5PzuPPJBFMbAf+PclG/QdIuDcoZDVwREa9GxCtZbm0k49GIqN12OD+3rTeO0kkYCdyenZv1Ja0ODI6Ih7IeD0fE/bnPOXneNZcASFoJWDkibsnl51W26fTziIgzI2JERIwYtNxKDapqZmZd0V86DwvQ4pfIaSzlIj0SuBpYGdiJkhuiGbVj147b6BzeiIi/RsTRlDkOuzfYrOH514ro5D3AREqnZzTlAg9lVGBv8pZFJ8eAtrkYaucYzX4eZmbWC/pt52ExTOR0G+UX+7yIeJ0ycvFVSqei3svACk0enzzuVpLWytdLZP3/kavfzHaEcv67S1oubz18ulKHtWvnQbmlc2v9cXKi5KOU3Ba1zsMEyu2UWufhPmCopPXz/X7ALdSJiBeAFyXVRjj2rZxPVz8PMzPrpn7beWjCIpXIKY/xKOWXO5QL9gqUJxjqXQwcnRMO12uwvpE1gL+oPOo4DZgDnJrrzgSmSbogIqZQ5iNMonRofhMRd+Z29wL7Z7uvApzezrHGAUtHxKP5fgJlDsj4PNfXKfMqLst2ngec0U5ZBwKn5YTJ6q2JLn0eZmbWfU6MZU1RiZVxZURstpCr0iVOjGVm1jo5MZaZmZn1BOe2sKZExCzKUyMDknNbmA0cvZWPwXqORx7MzMysJe489DNqy8txd0ZW/EY+HdGVsi5QyQEyI6M4LtX5Xl2TT5GcmBNQZ2T0x492oZwfSNq1N+poZmY9w52H/qcW4GpT4IPAbjSOGNmMCyhPIgwDlgW+3DNVbOiHlBDYm+Wkyk/Q4uOkABHxvYi4vqcrZ2ZmPcedh34sIv4FHAx8XUV7eTcGSTo5l0+TdFjuf3UkyqOX71LJrjlL82chfUDSmpJWl/R7lYybt0t6X64fXDnuNEl7VOupkuPiK8Bh+cgpEfFURFya6/fJfWdI+kmlzmerLcvnkbm8mj11lqTvS5qS22ycy5fPkZTbsx2afkzXzMy6zxMm+7mIeChvW6wBfCGXDcsL6bUZyOpA4D3AlhExR9Iq1TLydsV+wH9mRs0/UYI/nSVpW2BWRDwl6ULglIi4VdLawDXAe4H/Al7MPBRIentdNdcHHomIl+rrn4GpfkKJ6Pl81nl3SkyLd9Ye/ax2Zuo8ExFbSToUOIoyenIscGNEHJT7TZJ0fUS8Wt1R0sGUzheDVly93TY2M7PWeORhYKiFc24v78auwBkRMSfXPVe3/6+BMRFRixR5CSXIEpSw0Zfk612BUzPw0p+BFSWtkMtPqxUWEc+3UPdtgJsj4ums3wWUPBYPAetK+pWkj1ASbTXSKGfHh4Bjsp43A8tQIoHOx7ktzMx6h0ce+rkMwzyXEv2yvZwQ7eaAkHQcsDol1HXNBNqSVe0O/CiXLwFsX59kSlK75acHKKGrV8jw1PV1W0CmA98C+DAlk+nngIMabNooZ4eAPSJiZgd1MjOzXuKRh34sL+5nAKfmvIWGeTeAa4FDJC2Z61bJf79MuTjvk5k+AciyrgB+BtwbEc/mqmuBr1eOP7yd5fPdtoiI2cBvgV9KeltuM0TSFyghrt8vaTWVBGb7ALdIWg1YIiJ+T7ktslULTXMNJbS48lhbdrK9mZn1IHce+p9la49qAtdTLtzfz3Xt5d34DfAIJT/FXcDnc/szgDWBCVnm9yrHuYQyh+KSyrLDgRE5KfIe4JBc/iPg7Tm58S5g5wb1/i7wNHCPSv6MPwJPR8STlDTgNwF3AVMi4k/AO4Gb89bD2blNs34ILJXnOyPfm5lZH3FuC1ssOLeFmVnr5NwWZmZm1hPceTAzM7OW+GkLWyw4MZZZ85yYyjrjkQczMzNrSbc6D3ISpz5L4iRpfP47VNLnK8sPkHRqE/t/PEM53yXpHklfzeW7S9qkh+p4Z+3xTklLSno1H9esrZ8sqZVHMuvLv1nSAhN3zMysb3V35MFJnPooiVNEjMqXQ2l7FLMp2RE7E/hERGwBbEmJzAglSFSPdB6A8UCtnltQYlCMyjosD6xLeVyzmTr7lpqZWT/VY7ctnMSp60mcJP1a0ifz9RWSfpevvyTpR/n6ldz8RGCHHPE5MpetJelvOZLyPw0+nhUo81uezfN9IyJmShoFfBI4KctbT9JwSROz7a5QBoTKX/0/lzQ+22Fkg+OMo63zMIoSZ6IWaGokJcbDXEmrSPpjHmOipM3zGMdLOlPStcC5kpaVdHFudwmlU9nu52FmZn2jR+c8RMRDWeYalJDDZDKlfYBzJC1D6WDUkjhtThlxeIvakjj9LaMi1pI4oUoSJ+AXlCRO2wB7UAIlQSWJU5Z/Y101m0ni9AHKRW8blSROw8kkTnk+Z7XTBM9ExFbA6ZQkTtCWxGkbSnClk/JXeNUYYId8/U7aRgJGA2Prtj0GGJsjPqfksuGUXBXDgL0kvbu6Q+a6+DPwD0kXSdpX0hIRMT6XH53lPQicC3w72246848kLZ8jIIcCv2tw/tWRh1F5Xm+o5McYRelcQAl6dWce4zt5zJqtgU9FxOeBrwGzc7sf57ra+Xb6eUg6WNIdku6YO/vFRpuYmVkX9MaESSdxaj2J01jKaMImwD3AU5KGANtTLsiduSEiXoyI13P/deo3iIgvA7tQRnWOosHFX9JKwMoRcUsuOody/jUXZVljKO09XybMiJgFvE3SOyi3oGYCtwPbUjoPtXOpfjduBFbNYwP8uZJbY0fg/NxuGjAtlzf1eTgxlplZ7+jR+8pyEqcuJXGKiMfz9sBHKL/WV8ljvNKgjo28UXldPXb9caYD0yWdBzwMHNBE2fMV0cl7KJ/XnsCTERGSJgLvo9y2mJjbNGrnWlmvtrO8bUHzn4eZmfWCHht5kJM4tafZJE4TgCMo7TaWMjpQf8sC4GVanNyZ80B2qiwaThkJmq+8iHgReF5S7RbKfsAtlf32yvJGU24NNboXMA44Ms+ndl5fBP4ZES/ksup3YyfK7Z5GowfV7TYDanMjuvN5mJlZN3W38+AkTp1rNonTWGDJiHgAmEIZfWjUeZgGzFF5KKDjQAAACWlJREFU5LLZiYICvqXyKOxUymd0QK67GDg6J3OuB+xPmZcxjdLJ+EGlnOdVHhk9A/hSO8caR3mqYgJAtukg5r/9cjz52VEmgO7fTlmnA4Nzu29RbrlA9z4PMzPrJifGsqZIuhk4KiIGZHYpJ8YyM2udnBjLzMzMeoID8VhTImKnhV0HMzPrHzzyYGZmZi1x58HMzMxa4s6DmZmZtcSdBzMzM2uJOw9mZmbWEncezMzMrCXuPJiZmVlLHGHSFguSXqbkVhkoVgOeWdiVaJHr3Ddc59430OoLvVfndSJi9fqFDhJli4uZjUKs9leS7hhI9QXXua+4zr1voNUX+r7Ovm1hZmZmLXHnwczMzFrizoMtLs5c2BVo0UCrL7jOfcV17n0Drb7Qx3X2hEkzMzNriUcezMzMrCXuPJiZmVlL3HmwRYakj0iaKekBScc0WL+0pEty/W2ShvZ9LReoU2d13lHSFElzJO25MOpYr4k6f0PSPZKmSbpB0joLo551deqszodImi5pqqRbJW2yMOpZV6cO61zZbk9JIWmhPlrYRBsfIOnpbOOpkr68MOpZV6dO21jS5/L7fLekC/u6jg3q01k7n1Jp4//f3v3HyFHWcRx/f9pC6A8EpGJEGg/wqFpS2vRqaUQFa6IGcsVYYxtIqBESFWwMxh9o1aLhDyQGo0Co1FoqxCpVoYpaf7SlarzSQulPora1lUYTbBWwtNByfPxjnqvrdvd2xrqze9vvK7lkd/aZvc/M7e0+88yz8/2jpGeaEsR2/MTPkP8BhgM7gfOAk4FNwJuq2nwUuDvdng18bwhk7gImAkuBWUNkP18GjEq3PzJE9vMrKm73Aj9v98yp3anAWqAP6GnnvMBc4I5W7tf/IXM3sBE4I90/q90zV7X/GLC4GVli5CF0ijcDO2zvsn0YWAbMrGozE7g33V4OzJCkEjNWa5jZ9m7bm4GXWxGwhjyZV9s+mO72AeeUnLFanszPVdwdDbR6Jnme1zPAl4GvAC+UGa6GvHnbSZ7M1wF32v4ngO2nS85Yreh+ngN8txlBovMQOsVrgacq7u9Ny2q2sf0S8CxwZinpasuTud0Uzfwh4GdNTdRYrsySrpe0k+zDeF5J2eppmFnSZGCc7Z+UGayOvK+L96XTWcsljSsnWl15Ml8AXCDpd5L6JL27tHS15f7/S6cLzwVWNSNIdB5Cp6g1glB99JinTZnaLU8euTNLuhroAW5raqLGcmW2faft84FPA/Obnmpwg2aWNAy4HfhEaYkGl2cf/xjosj0R+BX/GQVslTyZR5CduriU7Ch+kaTTm5xrMEXeM2YDy233NyNIdB5Cp9gLVB7JnAP8tV4bSSOA04B/lJKutjyZ202uzJLeCXwO6LX9YknZ6im6n5cBVzY1UWONMp8KXAiskbQbuBhY0cJJkw33se39Fa+Fe4ApJWWrJ+97xkO2j9j+M1lxve6S8tVS5LU8myadsoDoPITOsR7olnSupJPJ/nFWVLVZAVyTbs8CVjnNKmqRPJnbTcPMaTh9IVnHodXniCFf5soPhMuBP5WYr5ZBM9t+1vZY2122u8jmlvTa3tCauLn28Wsq7vYCT5aYr5Y8/38Pkk0ARtJYstMYu0pN+d9yvWdIGg+cAfy+WUGi8xA6QprDcAOwkuxN6fu2t0n6kqTe1OxbwJmSdgA3AnW//laGPJklTZW0F3g/sFDSttYlzr2fbwPGAA+kr4u1tEOUM/MN6at4T5C9Nq6p83SlyJm5beTMOy/t401kc0rmtiZtJmfmlcB+SduB1cAnbe9vTeJCr4s5wLJmHhzF5alDCCGEUEiMPIQQQgihkOg8hBBCCKGQ6DyEEEIIoZDoPIQQQgihkOg8hBBCCKGQ6DyEEIY0Sf3pK6FbJT0gaVSrM9UjaYykhZJ2pq8trpU0rUm/a4kaVGJNlS7Prri/qB0qiob2F52HEMJQd8j2JNsXAoeBD+ddUdLw5sWqaRHZVU27bU8gu9bB2DwrKjOsatnx5p8LHO082L7W9vbjfM5wAojOQwihk/wGeD1ktTUkPZpGJRYOfNBKOpAuqrMOmC7pC5LWp5GLbw5UWpU0T9L2VMhpWVr2SkkPpmV9kiam5QskLZa0RtIuSccU1pJ0PjANmG/7ZYBUHfHh9PiNKcNWSR9Py7okPSnpLuBxYFyN/FMkPSLpMUkrq67kOPC7j9nGNCrRA9yf9tHIlL8nrTNH0pa0zq0Vz3VA0i2SNqV98Or/y18uDCnReQghdARl9UreA2yR9EbgA8BbbE8C+oGrUtPRwFbb02z/FrjD9tQ0cjESuCK1+wwwORVyGhjNuBnYmJZ9FlhaEeENwLvIyiZ/UdJJVREnAE/UKlQkaQrwQbLOxcXAdcou8w0wHlhqe7LtPZX5gXXAN4BZtqcAi4FbauyeY7bR9nJgA3BVGrk5VJHnbOBW4B3AJGCqpIF6H6OBPtsXAWvJylaHE0x0HkIIQ93IdFnpDcBfyC5DPoOs8NL69NgM4LzUvh/4QcX6l0laJ2kL2YflhLR8M9lR+dXAS2nZJcB3AGyvIrvc+WnpsYdtv2h7H/A0UOSI/BLgR7aft30A+CHw1vTYHtt9FW0r848nK5D1y7Sd88mKJVWrt431TAXW2P57uiTy/cDb0mOHgYEy4I8BXTm3MXSQEa0OEEIIx+lQGl04Kp16uNf2TTXavzBw9C/pFOAuoMf2U5IWAKekdpeTfWD2Ap+XNIHBSyJXVg/t59j3123ARZKGDZy2qIw8yPY9Xy9/Wm+b7en1Vm6wjXVXG+SxIxU1E2ptZzgBxMhDCKET/RqYJeksODpX4XU12g18iO6TNIas2ippYuI426uBTwGnkxX7Wks6/SHpUmCf7efyBLK9k2x05OaKeRXdkmam571S0ihJo4H3ks3faOQPwKskTU/Pd1Lq5DTcxuRfZOW9q60D3i5pbJorMgd4JM92hhND9BhDCB3H9nZJ84FfpI7AEeB6YE9Vu2ck3QNsAXaTlTwGGA7cl05JCLg9tV0AfFvSZuAgxatvXgt8Fdgh6SCwn6xS4+OSlgCPpnaLbG+U1NVgOw+niY9fT1lHAF8jG+VotI0AS4C7JR0Cples8zdJN5FVkhTwU9sPFdzW0MGiqmYIIYQQConTFiGEEEIoJDoPIYQQQigkOg8hhBBCKCQ6DyGEEEIoJDoPIYQQQigkOg8hhBBCKCQ6DyGEEEIo5N91/SlE9JWKHwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "fig, ax = plt.subplots()\n",
    "\n",
    "# Example data\n",
    "x = list(baselines.keys())\n",
    "x_pos = np.arange(len(x))\n",
    "y = list(results.values())\n",
    "\n",
    "ax.barh(x_pos, y, align='center')\n",
    "ax.set_yticks(x_pos)\n",
    "ax.set_yticklabels(x)\n",
    "ax.invert_yaxis() \n",
    "ax.set_xlabel('Perason Correlation')\n",
    "ax.set_title('Performance of Baseline Models')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean up data\n",
    "tmp_dir.cleanup()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/papermill.record+json": {
       "results": {
        "Doc2vec Cosine": 0.528387685928394,
        "Doc2vec Cosine with Stop Words": 0.45572884639905675,
        "GLoVe Cosine": 0.6688056947022161,
        "GLoVe Cosine with Stop Words": 0.6049380247374541,
        "GLoVe WMD": 0.6267300417407605,
        "GLoVe WMD with Stop Words": 0.48470008225931194,
        "TF-IDF Cosine": 0.6749213786510483,
        "TF-IDF Cosine with Stop Words": 0.7118087132257667,
        "Word2vec Cosine": 0.6476606845766778,
        "Word2vec Cosine with Stop Words": 0.6683808069062863,
        "Word2vec WMD": 0.6574175839579567,
        "Word2vec WMD with Stop Words": 0.5689438215886101,
        "fastText Cosine": 0.6707510007525627,
        "fastText Cosine with Stop Words": 0.6771300330824099,
        "fastText WMD": 0.6394958913339955,
        "fastText WMD with Stop Words": 0.5177829727556036
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Record results with scrapbook for tests\n",
    "sb.glue(\"results\", results)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
